I'm using LDA with gensim for topic modeling. My data has 23 documents and I want separate topics/words for each document but gensim is giving topics for entire set of documents together. How to get it for individual docs?
dictionary = corpora.Dictionary(doc_clean)
# Converting list of documents (corpus) into Document Term Matrix using
#dictionary prepared above.
corpus = [dictionary.doc2bow(doc) for doc in doc_clean]
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel
# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(corpus, num_topics=3, id2word = dictionary, passes=50)
result=ldamodel.print_topics(num_topics=3, num_words=3)
This is the output I'm getting:
[(0, '0.011*"plex" + 0.010*"game" + 0.009*"racing"'),
(1, '0.008*"app" + 0.008*"live" + 0.007*"share"'),
(2, '0.015*"device" + 0.009*"file" + 0.008*"movie"')]
print_topics() returns a list of topics, the words loading onto that topic and those words.
If you want the topic loadings per document, instead, you need to use get_document_topics().
From the gensim documentation:
get_document_topics(bow, minimum_probability=None, minimum_phi_value=None, per_word_topics=False)
Get the topic distribution for the given document.
Parameters:
bow (corpus : list of (int, float)) – The document in BOW format.
minimum_probability (float) – Topics with an assigned probability lower than this threshold will be discarded.
minimum_phi_value (float) – If per_word_topics is True, this represents a lower bound on the term probabilities that are included.
If set to None, a value of 1e-8 is used to prevent 0s.
per_word_topics (bool) – If True, this function will also return two extra lists as explained in the “Returns” section.
Returns:
list of (int, float) – Topic distribution for the whole document. Each element in the list is a pair of a topic’s id, and the probability that was assigned to it.
list of (int, list of (int, float), optional – Most probable topics per word. Each element in the list is a pair of a word’s id, and a list of topics sorted by their relevance to this word. Only returned if per_word_topics was set to True.
list of (int, list of float), optional – Phi relevance values, multiplied by the feature length, for each word-topic combination. Each element in the list is a pair of a word’s id and a list of the phi values between this word and each topic. Only returned if per_word_topics was set to True.
get_term_topics() and get_topic_terms() may also be potentially interesting for you.
If I understand you correctly, you need to put the entire thing inside a loop and do print_topics():
Your documents example:
doc1 = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc2 = "My mother spends a lot of time driving my brother around to baseball practice."
doc3 = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_set = [doc_a, doc_b, doc_c]
Now your loop must iterate through your doc_set:
for i in doc_set:
##### after all the cleaning in these steps, append to a list #####
dictionary = corpora.Dictionary(doc_clean)
corpus = [dictionary.doc2bow(doc) for doc in doc_clean]
##### set the num_topics you want for each document, I set one for now #####
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 1, id2word = dictionary, passes=20)
for i in ldamodel.print_topics():
print(i)
print('\n')
Sample output:
(0, '0.200*"brocolli" + 0.200*"eat" + 0.200*"good" + 0.133*"brother" + 0.133*"like" + 0.133*"mother"')
(0, '0.097*"brocolli" + 0.097*"eat" + 0.097*"good" + 0.097*"mother" + 0.097*"brother" + 0.065*"lot" + 0.065*"spend" + 0.065*"practic" + 0.065*"around" + 0.065*"basebal"')
(0, '0.060*"drive" + 0.060*"eat" + 0.060*"good" + 0.060*"mother" + 0.060*"brocolli" + 0.060*"brother" + 0.040*"pressur" + 0.040*"health" + 0.040*"caus" + 0.040*"increas"')
Related
I am currently using h2o.ai to perform some NLP. I have a trained model for my corpus in Word2Vec and have successfully aggregated a number of records with the method "Average". The problem comes in when I want to create features for my DRF model by using this w2v model to create a bag of words for each entry. When I use the aggregate method "none" the vectors are returned in a single column containing NaN's where The records begin and end, however the unknown words in the model are also being mapped to NaN and not the the unknown word vector. This is stopping me from reorganizing the vectors into a bag of words for each record because the record separation association is lost due to the extra and unpredictably entered NaNs. Is there a fix for this?
I am currently going to use the original tokenized list to make an index of the original double NaN structure that is used to deliminate between records and then recombine my vectors based off of this. Just wanted to throw this out there to see if anyone else is dealing with this or if there is some type of fix in place that I cannot find on the interwebs.
DATA = pd.read_sql(sql, conn1)
steps = [
(r'[\n\t\’\–\”\“\!~`\"##\$%\^\&\*()_+\{\}|:<>\?\-=\[\]\\;\',./\d]', ' '),
(r'\s+', ' ')
]
steps = [ (re.compile(a), b) for (a, b) in steps ]
def do_steps(anarr):
for pattern,replacement in steps:
anarr = pattern.sub(replacement,anarr)
return anarr
DATA.NARR = DATA.NARR.apply(do_steps)
train_hdata = h2o.H2OFrame(DATA).ascharacter()
train_narr = train_hdata["NARR"]
train_key = train_hdata["KEY"]
train_tokens_narr = train_narr.tokenize(split=' ')
train_vecs = w2v.transform(train_tokens_narr, aggregate_method='NONE')
VECS = train_vecs.as_data_frame()
df = train_tokens_narr.as_data_frame()
B=(VECS.isnull()&df.isnull())
idx = B[B['C1'] == True].index.tolist()
X = []
X.append('')
j=0
for i in tqdm(range(len(VECS.C1)-1)):
if i in idx:
X[j]= X[j][:-2]
j+=1
X.append('')
else:
X[j]= X[j] + str(VECS.C1[i])[:6] + ', '
s = pd.DataFrame({"C1":X})
print(s)
The above is the current code looking to take some records and encode them with the word2vec model for a bag of words. The bottom portion is a draft loop that I am using to put the correct vectors with the correct records. Let me know if I need to clarify.
Unfortunately the functionality to distinguish between words that are missing from your dictionary and NAs used to demarcate the start and end of a record is not currently available. I've made a jira ticket here to track the issue. Please feel free to comment or update the ticket.
I have a csv with a single column, each row is a text document. All text has been normalized:
all lowercase
no punctuation
no numbers
no more than one whitespace between words
no tags(xml, html)
I have also this R script which constructs the Document Term Matrix on these documents and does some machine learning analysis. I need to convert this in Spark.
The first step is to produce the Document Term Matrix where for each term there is the relative frequency count in the document. The problem is that I am getting different vocabularies size using R, respect to spark api or python sklearn (spark and python are consistent in the result).
This is the relevant code for R:
library(RJDBC)
library(Matrix)
library(tm)
library(wordcloud)
library(devtools)
library(lsa)
library(data.table)
library(dplyr)
library(lubridate)
corpus <- read.csv(paste(inputDir, "corpus.csv", sep="/"), stringsAsFactors=FALSE)
DescriptionDocuments<-c(corpus$doc_clean)
DescriptionDocuments <- VCorpus(VectorSource(DescriptionDocuments))
DescriptionDocuments.DTM <- DocumentTermMatrix(DescriptionDocuments, control = list(tolower = FALSE,
stopwords = FALSE,
removeNumbers = FALSE,
removePunctuation = FALSE,
stemming=FALSE))
# VOCABULARY SIZE = 83758
This is the relevant code in Spark (1.6.0, Scala 2.10):
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, RegexTokenizer}
var corpus = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "false").load("/path/to/corpus.csv")
// RegexTokenizer splits by default on one or more spaces, which is ok
val rTokenizer = new RegexTokenizer().setInputCol("doc").setOutputCol("words")
val words = rTokenizer.transform(corpus)
val cv = new CountVectorizer().setInputCol("words").setOutputCol("tf")
val cv_model = cv.fit(words)
var dtf = cv_model.transform(words)
// VOCABULARY SIZE = 84290
I've also checked in python sklearn and I got consistent result with spark:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
corpus = pd.read_csv("/path/to/corpus.csv")
docs = corpus.loc[:, "doc"].values
def tokenizer(text):
return text.split
cv = CountTokenizer(tokenizer=tokenizer, stop_words=None)
dtf = cv.fit_transform(docs)
print len(dtf.vocabulary_)
# VOCABULARY SIZE = 84290
I don't know very much R tm package but it seems to me that by default should tokenize on white spaces by default. Someone has any hint why am I getting different vocabulary size?
The reason for the difference is a default option within the creation of a document term matrix. If you check ?termFreq you can find the option wordLengths:
An integer vector of length 2. Words shorter than the minimum word
length wordLengths[1] or longer than the maximum word length
wordLengths[2] are discarded. Defaults to c(3, Inf), i.e., a minimum
word length of 3 characters.
The default setting of c(3, Inf) removes all words shorter than 3, like "at", "in", "I", etc etc.
This default is what is causing the difference between tm and spark / python
See the difference in the wordLengths setting in the example below.
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude)
nTerms(dtm)
[1] 1266
dtm2 <- DocumentTermMatrix(crude, control = list(wordLengths = c(1, Inf)))
nTerms(dtm2)
[1] 1305
I have obtained a dictionary mapping words to their vectors in python, and I am trying to scatter plot the n most similar words since TSNE on huge number of words is taking forever. The best option is to convert the dictionary to a w2v object to deal with it.
I had the same issue and I finaly found the solution
So, I assume that your dictionary looks like mine
d = {}
d['1'] = np.random.randn(300)
d['2'] = np.random.randn(300)
Basically, the keys are the users' ids and each of them has a vector with shape (300,).
So now, in order to use it as word2vec I need to firstly save it to binary file and then load it with gensim library
from numpy import zeros, dtype, float32 as REAL, ascontiguousarray, fromstring
from gensim import utils
m = gensim.models.keyedvectors.Word2VecKeyedVectors(vector_size=300)
m.vocab = d
m.vectors = np.array(list(d.values()))
my_save_word2vec_format(binary=True, fname='train.bin', total_vec=len(d), vocab=m.vocab, vectors=m.vectors)
Where my_save_word2vec_format function is:
def my_save_word2vec_format(fname, vocab, vectors, binary=True, total_vec=2):
"""Store the input-hidden weight matrix in the same format used by the original
C word2vec-tool, for compatibility.
Parameters
----------
fname : str
The file path used to save the vectors in.
vocab : dict
The vocabulary of words.
vectors : numpy.array
The vectors to be stored.
binary : bool, optional
If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
total_vec : int, optional
Explicitly specify total number of vectors
(in case word vectors are appended with document vectors afterwards).
"""
if not (vocab or vectors):
raise RuntimeError("no input")
if total_vec is None:
total_vec = len(vocab)
vector_size = vectors.shape[1]
assert (len(vocab), vector_size) == vectors.shape
with utils.smart_open(fname, 'wb') as fout:
print(total_vec, vector_size)
fout.write(utils.to_utf8("%s %s\n" % (total_vec, vector_size)))
# store in sorted order: most frequent words at the top
for word, row in vocab.items():
if binary:
row = row.astype(REAL)
fout.write(utils.to_utf8(word) + b" " + row.tostring())
else:
fout.write(utils.to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))
And then use
m2 = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format('train.bin', binary=True)
To load the model as word2vec
If you've calculated the word-vectors with your own code, you may want to write them to a file in a format compatible with Google's original word2vec.c or gensim. You can review the gensim code in KeyedVectors.save_word2vec_format() to see exactly how its vectors are written – it's less than 20 lines of code – and do something similar to your vectors. See:
https://github.com/RaRe-Technologies/gensim/blob/3d2227d58b10d0493006a3d7e63b98d64e991e60/gensim/models/keyedvectors.py#L130
Then you could re-load vectors that originated with your code and use them almost directly with examples like the one from Jeff Delaney you mention.
I am applying the LDA method using Gensim to extract keywords from documents.
I can extract topics, and then assign these topics and key words associated to the documents.
I would like to have the ids of these terms (or key words) instead of the terms themselves. I know that corpus[i] extract a list of couples [(term_id, term_frequency) ...] of document i but I can't see how could I use this in my code to extract only the ids and assign it to my results.
My code is as follows :
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=passes, minimum_probability=0)
# Assinging the topics to the document in corpus
lda_corpus = ldamodel[corpus]
# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
for topic in [doc for doc in lda_corpus]]))
threshold = sum(scores)/len(scores)
print(threshold)
for t in range(len(topic_tuple)):
key_words.append([topic_tuple[t][j][0] for j in range(num_words)])
df_key_words = pd.DataFrame({'key_words' : key_words})
documents_corpus.append([j for i,j in zip(lda_corpus,doc_set) if i[t][1] > threshold])
df_documents_corpus = pd.DataFrame({'documents_corpus' : documents_corpus})
documents_corpus_id.append([i for d,i in zip(lda_corpus, doc_set_id) if d[t][1] > threshold])
df_documents_corpus_id = pd.DataFrame({'documents_corpus_id' : documents_corpus_id})
result.append(pd.concat([df_key_words, df_documents_corpus, df_documents_corpus_id ], axis=1))
Thank you in advance and ask me if more information are needed.
In case someone has the same issue that I had, here is the answer for a reverse map :
reverse_map = dict((ldamodel.id2word[id],id) for id in ldamodel.id2word)
Thanks to bigdeeperadvisors
I am working on keyword extraction problem. Consider the very general case
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
t = """Two Travellers, walking in the noonday sun, sought the shade of a widespreading tree to rest. As they lay looking up among the pleasant leaves, they saw that it was a Plane Tree.
"How useless is the Plane!" said one of them. "It bears no fruit whatever, and only serves to litter the ground with leaves."
"Ungrateful creatures!" said a voice from the Plane Tree. "You lie here in my cooling shade, and yet you say I am useless! Thus ungratefully, O Jupiter, do men receive their blessings!"
Our best blessings are often the least appreciated."""
tfs = tfidf.fit_transform(t.split(" "))
str = 'tree cat travellers fruit jupiter'
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
print(feature_names[col], ' - ', response[0, col])
and this gives me
(0, 28) 0.443509712811
(0, 27) 0.517461475101
(0, 8) 0.517461475101
(0, 6) 0.517461475101
tree - 0.443509712811
travellers - 0.517461475101
jupiter - 0.517461475101
fruit - 0.517461475101
which is good. For any new document that comes in, is there a way to get the top n terms with the highest tfidf score?
You have to do a little bit of a song and dance to get the matrices as numpy arrays instead, but this should do what you're looking for:
feature_array = np.array(tfidf.get_feature_names())
tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]
n = 3
top_n = feature_array[tfidf_sorting][:n]
This gives me:
array([u'fruit', u'travellers', u'jupiter'],
dtype='<U13')
The argsort call is really the useful one, here are the docs for it. We have to do [::-1] because argsort only supports sorting small to large. We call flatten to reduce the dimensions to 1d so that the sorted indices can be used to index the 1d feature array. Note that including the call to flatten will only work if you're testing one document at at time.
Also, on another note, did you mean something like tfs = tfidf.fit_transform(t.split("\n\n"))? Otherwise, each term in the multiline string is being treated as a "document". Using \n\n instead means that we are actually looking at 4 documents (one for each line), which makes more sense when you think about tfidf.
Solution using sparse matrix itself (without .toarray())!
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
corpus = [
'I would like to check this document',
'How about one more document',
'Aim is to capture the key words from the corpus',
'frequency of words in a document is called term frequency'
]
X = tfidf.fit_transform(corpus)
feature_names = np.array(tfidf.get_feature_names())
new_doc = ['can key words in this new document be identified?',
'idf is the inverse document frequency caculcated for each of the words']
responses = tfidf.transform(new_doc)
def get_top_tf_idf_words(response, top_n=2):
sorted_nzs = np.argsort(response.data)[:-(top_n+1):-1]
return feature_names[response.indices[sorted_nzs]]
print([get_top_tf_idf_words(response,2) for response in responses])
#[array(['key', 'words'], dtype='<U9'),
array(['frequency', 'words'], dtype='<U9')]
Here is a quick code for that:
(documents is a list)
def get_tfidf_top_features(documents,n_top=10):
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
importance = np.argsort(np.asarray(tfidf.sum(axis=0)).ravel())[::-1]
tfidf_feature_names = np.array(tfidf_vectorizer.get_feature_names())
return tfidf_feature_names[importance[:n_top]]