Doc2vec: How to get document vectors - python

How to get document vectors of two text documents using Doc2vec?
I am new to this, so it would be helpful if someone could point me in the right direction / help me with some tutorial
I am using gensim.
doc1=["This is a sentence","This is another sentence"]
documents1=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4)
I get
AttributeError: 'list' object has no attribute 'words'
whenever I run this.

If you want to train Doc2Vec model, your data set needs to contain lists of words (similar to Word2Vec format) and tags (id of documents). It can also contain some additional info (see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb for more information).
# Import libraries
from gensim.models import doc2vec
from collections import namedtuple
# Load data
doc1 = ["This is a sentence", "This is another sentence"]
# Transform data (you can add more data preprocessing steps)
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
words = text.lower().split()
tags = [i]
docs.append(analyzedDocument(words, tags))
# Train model (set min_count = 1, if you want the model to work with the provided example data set)
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)
# Get the vectors
model.docvecs[0]
model.docvecs[1]
UPDATE (how to train in epochs):
This example became outdated, so I deleted it. For more information on training in epochs, see this answer or #gojomo's comment.

Gensim was updated. The syntax of LabeledSentence does not contain labels. There are now tags - see documentation for LabeledSentence https://radimrehurek.com/gensim/models/doc2vec.html
However, #bee2502 was right with
docvec = model.docvecs[99]
It will should the 100th vector's value for trained model, it works with integers and strings.

doc=["This is a sentence","This is another sentence"]
documents=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents, size = 100, window = 300, min_count = 10, workers=4)
I got AttributeError: 'list' object has no attribute 'words' because the input documents to the Doc2vec() was not in correct LabeledSentence format.
I hope this below example will help you understand the format.
documents = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1'])
More details are here : http://rare-technologies.com/doc2vec-tutorial/
However, I solved the problem by taking input data from file using TaggedLineDocument().
File format: one document = one line = one TaggedDocument object.
Words are expected to be already preprocessed and separated by whitespace, tags are constructed automatically from the document line number.
sentences=doc2vec.TaggedLineDocument(file_path)
model = doc2vec.Doc2Vec(sentences,size = 100, window = 300, min_count = 10, workers=4)
To get document vector :
You can use docvecs. More details here : https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument
docvec = model.docvecs[99]
where 99 is the document id whose vector we want. If labels are in integer format (by default, if you load using TaggedLineDocument() ), directly use integer id like I did. If labels are in string format,use "SENT_99" .This is similar to Word2vec

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
Documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(doc1)]
Model = Doc2Vec(Documents, other parameters~~)
This should work fine. You need to tag your documents for training doc2vec model.

Related

training a Fasttext model

I want to train a Fasttext model in Python using the "gensim" library. First, I should tokenize each sentences to its words, hence converting each sentence to a list of words. Then, this list should be appended to a final list. Therefore, at the end, I will have a nested list containing all tokenized sentences:
word_punctuation_tokenizer = nltk.WordPunctTokenizer()
word_tokenized_corpus = []
for line in open('sentences.txt'):
new = line.strip()
new = word_punctuation_tokenizer.tokenize(new)
if len(new) != 0:
word_tokenized_corpus.append(new)
Then, the model should be built as the following:
embedding_size = 60
window_size = 40
min_word = 5
down_sampling = 1e-2
ft_model = FastText(word_tokenized_corpus,
size=embedding_size,
window=window_size,
min_count=min_word,
sample=down_sampling,
sg=1,
iter=100)
However, the number of sentences in "word_tokenized_corpus" is very large and the program can't handle it. Is it possible that I train the model by giving each tokenized sentence to it one by one, such as the following:?
for line in open('sentences.txt'):
new = line.strip()
new = word_punctuation_tokenizer.tokenize(new)
if len(new) != 0:
ft_model = FastText(new,
size=embedding_size,
window=window_size,
min_count=min_word,
sample=down_sampling,
sg=1,
iter=100)
Does this make any difference to the final results? Is it possible to train the model without having to build such a large list and keeping it in the memory?
Since the volume of the data is very high, it is better to convert the text file into a COR file. Then, read it in the following way:
from gensim.test.utils import datapath
corpus_file = datapath('sentences.cor')
As for the next step:
model = FastText(size=embedding_size,
window=window_size,
min_count=min_word,
sample=down_sampling,
sg=1,
iter=100)
model.build_vocab(corpus_file=corpus_file)
total_words = model.corpus_total_words
model.train(corpus_file=corpus_file, total_words=total_words, epochs=5)
If you want to use the default fasttext API, here how you can do it:
root = "path/to/all/the/texts/in/a/single/txt/files.txt"
training_param = {
'ws': window_size,
'minCount': min_word,
'dim': embedding_size,
't': down_sampling,
'epoch': 5,
'seed': 0
}
# for all the parameters: https://fasttext.cc/docs/en/options.html
model = fasttext.train_unsupervised(path, **training_param)
model.save_model("embeddings_300_fr.bin")
The advantage of using the fasttext API is (1) implemented in C++ with a wrapper in Python (way faster than Gensim) (also multithreaded) (2) manage better the reading of the text. It is also possible to use it directly from the command line.

Convert Python dictionary to Word2Vec object

I have obtained a dictionary mapping words to their vectors in python, and I am trying to scatter plot the n most similar words since TSNE on huge number of words is taking forever. The best option is to convert the dictionary to a w2v object to deal with it.
I had the same issue and I finaly found the solution
So, I assume that your dictionary looks like mine
d = {}
d['1'] = np.random.randn(300)
d['2'] = np.random.randn(300)
Basically, the keys are the users' ids and each of them has a vector with shape (300,).
So now, in order to use it as word2vec I need to firstly save it to binary file and then load it with gensim library
from numpy import zeros, dtype, float32 as REAL, ascontiguousarray, fromstring
from gensim import utils
m = gensim.models.keyedvectors.Word2VecKeyedVectors(vector_size=300)
m.vocab = d
m.vectors = np.array(list(d.values()))
my_save_word2vec_format(binary=True, fname='train.bin', total_vec=len(d), vocab=m.vocab, vectors=m.vectors)
Where my_save_word2vec_format function is:
def my_save_word2vec_format(fname, vocab, vectors, binary=True, total_vec=2):
"""Store the input-hidden weight matrix in the same format used by the original
C word2vec-tool, for compatibility.
Parameters
----------
fname : str
The file path used to save the vectors in.
vocab : dict
The vocabulary of words.
vectors : numpy.array
The vectors to be stored.
binary : bool, optional
If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
total_vec : int, optional
Explicitly specify total number of vectors
(in case word vectors are appended with document vectors afterwards).
"""
if not (vocab or vectors):
raise RuntimeError("no input")
if total_vec is None:
total_vec = len(vocab)
vector_size = vectors.shape[1]
assert (len(vocab), vector_size) == vectors.shape
with utils.smart_open(fname, 'wb') as fout:
print(total_vec, vector_size)
fout.write(utils.to_utf8("%s %s\n" % (total_vec, vector_size)))
# store in sorted order: most frequent words at the top
for word, row in vocab.items():
if binary:
row = row.astype(REAL)
fout.write(utils.to_utf8(word) + b" " + row.tostring())
else:
fout.write(utils.to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))
And then use
m2 = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format('train.bin', binary=True)
To load the model as word2vec
If you've calculated the word-vectors with your own code, you may want to write them to a file in a format compatible with Google's original word2vec.c or gensim. You can review the gensim code in KeyedVectors.save_word2vec_format() to see exactly how its vectors are written – it's less than 20 lines of code – and do something similar to your vectors. See:
https://github.com/RaRe-Technologies/gensim/blob/3d2227d58b10d0493006a3d7e63b98d64e991e60/gensim/models/keyedvectors.py#L130
Then you could re-load vectors that originated with your code and use them almost directly with examples like the one from Jeff Delaney you mention.

Creating features function for further classification in python

I have read a description, how to apply random forest regression here. In this example the authors use the following code to create the features:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",max_features = 5000)
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = train_data_features.toarray()
I am thinking of combining several possibilities as features and turn them on and off. And I don't know how to do it.
What I have so far is that I define a class, where I will be able to turn on and off the features and see if it brings something (for example, all unigrams and 20 most frequent unigrams, it could be then 10 most frequent adjectives, tf-idf). But for now I don't understand how to combine them together.
The code looks like this, and in the function part I am lost (the kind of function I have would replicate what they do in the tutorial, but it doesn't seem to be really helpful the way I do it):
class FeatureGen: #for example, feat = FeatureGen(unigrams = False) creates feature set without the turned off feature
def __init__(self, unigrams = True, unigrams_freq = True)
self.unigrams = unigrams
self.unigrams_freq = unigrams_freq
def get_features(self, input):
vectorizer = CountVectorizer(analyzer = "word",max_features = 5000)
tokens = input["token"]
if self.unigrams:
train_data_features = vectorizer.fit_transform(tokens)
return train_data_features
What should I do to add one more feature possibility? Like contains 10 most frequent words.
if self.unigrams
train_data_features = vectorizer.fit_transform(tokens)
if self.unigrams_freq:
#something else
return features #and this should be a combination somehow
Looks like you need np.hstack
However you need each features array to have one row per training case.

Gensim - LDA create a document- topic matrix

I am working on a project where I need to apply topic modelling to a set of documents and I need to create a matrix :
DT , a D × T matrix, where D is the number of documents and T is the number of topics. DT(ij) contains the number of times a word in document Di has been assigned to topic Tj.
So far I have followed this tut: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html
I am new to gensim and so far I have
1. created a document list
2. preprocessed and tokenized the documents.
3. Used corpora.Dictionary() to create id-> term dictionary (id2word)
4. convert tokenized documents into a document-term matrix
generated an LDA model.
So now I get the topics.
How can I now get the matrix that I mentioned before.
I will be using this matrix to calculate similarity between 2 documents on topic t as :
sim(a,b) = 1- |DT(a,t) - DT(b, t)|
There is an implementation in the pyLDAvis source code which returns the lists that may be helpful for building the matrix you are interested in.
Snippet from the _extract_data method in gensim.py:
def _extract_data(topic_model, corpus, dictionary, doc_topic_dists=None):
...
...
...
return {'topic_term_dists': topic_term_dists, 'doc_topic_dists': doc_topic_dists,
'doc_lengths': doc_lengths, 'vocab': vocab, 'term_frequency': term_freqs}
The number of topics for your model will be static. Maybe you're interested in finding the document topic distribution for the T matrix. In that case, the DxT matrix would be doc_lengths x doc_topic_dists.
Showing your code would be helpful, but if we were to go off of the example in the tutorial you linked then the model is identified by:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)
you could put into your script something like:
model_name = "name_of_my_model"
ldamodel.save(model_name)
Then when you run it, this will create a model in the same directory that the script is run from.
Then you can get topic probability distribution with:
print(ldamodel[doc_bow])
If you want to get similarity to this model then you need to create a model for the query document, too, and then get cosine similarity between the two:
dictionary = corpora.Dictionary.load('dictionary.dict')
corpus = corpora.MmCorpus("corpus.mm")
lda = models.LdaModel.load("name_of_my_model.lda")
index = similarities.MatrixSimilarity(lda[corpus])
index.save("simIndex.index")
docname = "docs/the_doc.txt"
doc = open(docname, 'r').read()
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lda = lda[vec_bow]
sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims

Nltk naive bayesian classifier memory issue

my first post here!
I have problems using the nltk NaiveBayesClassifier. I have a training set of 7000 items. Each training item has a description of 2 or 3 worlds and a code. I would like to use the code as label of the class and each world of the description as features.
An example:
"My name is Obama", 001
...
Training set = {[feature['My']=True,feature['name']=True,feature['is']=True,feature[Obama]=True], 001}
Unfortunately, using this approach, the training procedure NaiveBayesClassifier.train use up to 3 GB of ram..
What's wrong in my approach?
Thank you!
def document_features(document): # feature extractor
document = set(document)
return dict((w, True) for w in document)
...
words=set()
entries = []
train_set= []
train_length = 2000
readfile = open("atcname.pl", 'r')
t = readfile.readline()
while (t!=""):
t = t.split("'")
code = t[0] #class
desc = t[1] # description
words = words.union(s) #update dictionary with the new words in the description
entries.append((s,code))
t = readfile.readline()
train_set = classify.util.apply_features(document_features, entries[:train_length])
classifier = NaiveBayesClassifier.train(train_set) # Training
Use nltk.classify.apply_features which returns an object that acts like a list but does not store all the feature sets in memory.
from nltk.classify import apply_features
More Information and a Example here
You are loading the file anyway into the memory, you will need to use some form of lazy loading method. Which will load as per need basis.
Consider looking into this

Categories

Resources