Convert Python dictionary to Word2Vec object - python
I have obtained a dictionary mapping words to their vectors in python, and I am trying to scatter plot the n most similar words since TSNE on huge number of words is taking forever. The best option is to convert the dictionary to a w2v object to deal with it.
I had the same issue and I finaly found the solution
So, I assume that your dictionary looks like mine
d = {}
d['1'] = np.random.randn(300)
d['2'] = np.random.randn(300)
Basically, the keys are the users' ids and each of them has a vector with shape (300,).
So now, in order to use it as word2vec I need to firstly save it to binary file and then load it with gensim library
from numpy import zeros, dtype, float32 as REAL, ascontiguousarray, fromstring
from gensim import utils
m = gensim.models.keyedvectors.Word2VecKeyedVectors(vector_size=300)
m.vocab = d
m.vectors = np.array(list(d.values()))
my_save_word2vec_format(binary=True, fname='train.bin', total_vec=len(d), vocab=m.vocab, vectors=m.vectors)
Where my_save_word2vec_format function is:
def my_save_word2vec_format(fname, vocab, vectors, binary=True, total_vec=2):
"""Store the input-hidden weight matrix in the same format used by the original
C word2vec-tool, for compatibility.
Parameters
----------
fname : str
The file path used to save the vectors in.
vocab : dict
The vocabulary of words.
vectors : numpy.array
The vectors to be stored.
binary : bool, optional
If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
total_vec : int, optional
Explicitly specify total number of vectors
(in case word vectors are appended with document vectors afterwards).
"""
if not (vocab or vectors):
raise RuntimeError("no input")
if total_vec is None:
total_vec = len(vocab)
vector_size = vectors.shape[1]
assert (len(vocab), vector_size) == vectors.shape
with utils.smart_open(fname, 'wb') as fout:
print(total_vec, vector_size)
fout.write(utils.to_utf8("%s %s\n" % (total_vec, vector_size)))
# store in sorted order: most frequent words at the top
for word, row in vocab.items():
if binary:
row = row.astype(REAL)
fout.write(utils.to_utf8(word) + b" " + row.tostring())
else:
fout.write(utils.to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))
And then use
m2 = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format('train.bin', binary=True)
To load the model as word2vec
If you've calculated the word-vectors with your own code, you may want to write them to a file in a format compatible with Google's original word2vec.c or gensim. You can review the gensim code in KeyedVectors.save_word2vec_format() to see exactly how its vectors are written – it's less than 20 lines of code – and do something similar to your vectors. See:
https://github.com/RaRe-Technologies/gensim/blob/3d2227d58b10d0493006a3d7e63b98d64e991e60/gensim/models/keyedvectors.py#L130
Then you could re-load vectors that originated with your code and use them almost directly with examples like the one from Jeff Delaney you mention.
Related
Updating dictionary and TFIDF for new documents using Gensim
My use case is that I have a corpus of documents, and when new docs come in I update Dictionary and vectorize. The result should be a sparse matrix of TFIDF vectors, which I'm using corpus2csc for. I think I have a solution, but I have seen other answers that suggest my solution is impossible, and I've seen some unexpected behavior. I'm seeking a gut-check on the approach, with some specific questions below. Overall approach: Use Dictionary.doc2bow with allow_update=True Construct TFIDF model using that updated Dictionary. Questions and example code below. from gensim import corpora from gensim import models from gensim import matutils docs = [['definition', 'addition', 'term', 'defined'], ['term', 'sweet', 'generally', 'subject'], ['gene', 'extent', 'provided', 'gene'], ['additional', 'cost', 'gene', 'adequacy'], ['initial', 'condition', 'sweet', 'effectiveness']] gensim_dict = corpora.Dictionary([], prune_at=None) # start with empty dict because I don't know what the corpus looks like at this point BoW_corpus0 = [gensim_dict.doc2bow(d, allow_update=True) for d in docs] # Add to the dictionary by setting allow_update=True # documentation says: corpus (iterable of iterable of (int, number)) – Input corpus in BoW format matutils.corpus2csc(corpus=BoW_corpus0) # Expected. A 16x5 matrix with 19 stored elements matutils.corpus2csc(corpus=BoW_corpus0, num_terms=len(gensim_dict.keys()), num_docs=len(BoW_corpus0), num_nnz=sum([len(doc) for doc in BoW_corpus0])) # Also expected for the 'efficient' path model = models.TfidfModel(BoW_corpus0) vecs = model[BoW_corpus0] matutils.corpus2csc(corpus=vecs) # Unexpected - asks for a BoW format, but this is TransformedCorpus. But ignoring that, it's expected -- 16x5 with 19 stored elements. matutils.corpus2csc(corpus=vecs, num_terms=len(gensim_dict.keys()), num_docs=vecs.obj.num_docs, num_nnz=vecs.obj.num_nnz) # Expected corpus2csc documentation asks for a BoW format, not a TransformedCorpus. Does it accept a TransformedCorpus just by chance? Now I add new docs: new_docs = [['provided', 'communication', 'provided', 'writing']] BoW_corpus1 = [gensim_dict.doc2bow(d, allow_update=True) for d in new_docs] len(gensim_dict.cfs) # 18 gensim_dict.num_nnz # 22 len(BoW_corpus1) # 1 matutils.corpus2csc(corpus=BoW_corpus1) # Expected. An 18x1 matrix, with 3 stored elements matutils.corpus2csc(corpus=BoW_corpus1, num_terms=len(gensim_dict.keys()), num_docs=len(BoW_corpus1), num_nnz=sum([len(doc) for doc in BoW_corpus1])) # Expected. An 18x1 matrix, with 3 stored elements # Let's use the original model. Ideally we would update the model with the new document but I'm not sure of the best way to do that. new_vecs = model[BoW_corpus1] matutils.corpus2csc(corpus=new_vecs) # Unexpected. Using the original model, a 10x1 with 1 stored elements. Why not 18x1? Why does this approach result in a 10x1 matrix? # let's try using the Dictionary that's been updated on the fly dict_based_model = models.TfidfModel(dictionary=gensim_dict) new_vecs2 = dict_based_model[BoW_corpus1] matutils.corpus2csc(corpus=new_vecs2) # Expected. 18x1 with 3 stored elements. matutils.corpus2csc(corpus=new_vecs2, num_terms=len(gensim_dict.keys()), num_docs=len(new_vecs2), num_nnz=sum([len(v) for v in new_vecs2])) # Also expected. 18x1 with 3 stored elements. # But why are these not the same? assert new_vecs2.obj.num_docs == len(new_vecs) assert new_vecs2.obj.num_nnz == sum([len(v) for v in new_vecs2]) # Finally, let's make a model based on the new corpus, I know this isn't right but curious why the output is what it is new_corpus_based_model = models.TfidfModel(BoW_corpus1) new_vecs3 = new_corpus_based_model[BoW_corpus1] matutils.corpus2csc(corpus=new_vecs3) # 0x1 with 0 elements. Kinda expected. But I would have thought it would have produced a 2x1 or 3x1 matrix Can you confirm that dict_based_model is the right approach? What is the new_vecs2.obj all about? Why do I get a 0x1 instead of a 2x1?
How can I print document wise topics in Gensim?
I'm using LDA with gensim for topic modeling. My data has 23 documents and I want separate topics/words for each document but gensim is giving topics for entire set of documents together. How to get it for individual docs? dictionary = corpora.Dictionary(doc_clean) # Converting list of documents (corpus) into Document Term Matrix using #dictionary prepared above. corpus = [dictionary.doc2bow(doc) for doc in doc_clean] # Creating the object for LDA model using gensim library Lda = gensim.models.ldamodel.LdaModel # Running and Trainign LDA model on the document term matrix. ldamodel = Lda(corpus, num_topics=3, id2word = dictionary, passes=50) result=ldamodel.print_topics(num_topics=3, num_words=3) This is the output I'm getting: [(0, '0.011*"plex" + 0.010*"game" + 0.009*"racing"'), (1, '0.008*"app" + 0.008*"live" + 0.007*"share"'), (2, '0.015*"device" + 0.009*"file" + 0.008*"movie"')]
print_topics() returns a list of topics, the words loading onto that topic and those words. If you want the topic loadings per document, instead, you need to use get_document_topics(). From the gensim documentation: get_document_topics(bow, minimum_probability=None, minimum_phi_value=None, per_word_topics=False) Get the topic distribution for the given document. Parameters: bow (corpus : list of (int, float)) – The document in BOW format. minimum_probability (float) – Topics with an assigned probability lower than this threshold will be discarded. minimum_phi_value (float) – If per_word_topics is True, this represents a lower bound on the term probabilities that are included. If set to None, a value of 1e-8 is used to prevent 0s. per_word_topics (bool) – If True, this function will also return two extra lists as explained in the “Returns” section. Returns: list of (int, float) – Topic distribution for the whole document. Each element in the list is a pair of a topic’s id, and the probability that was assigned to it. list of (int, list of (int, float), optional – Most probable topics per word. Each element in the list is a pair of a word’s id, and a list of topics sorted by their relevance to this word. Only returned if per_word_topics was set to True. list of (int, list of float), optional – Phi relevance values, multiplied by the feature length, for each word-topic combination. Each element in the list is a pair of a word’s id and a list of the phi values between this word and each topic. Only returned if per_word_topics was set to True. get_term_topics() and get_topic_terms() may also be potentially interesting for you.
If I understand you correctly, you need to put the entire thing inside a loop and do print_topics(): Your documents example: doc1 = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother." doc2 = "My mother spends a lot of time driving my brother around to baseball practice." doc3 = "Some health experts suggest that driving may cause increased tension and blood pressure." doc_set = [doc_a, doc_b, doc_c] Now your loop must iterate through your doc_set: for i in doc_set: ##### after all the cleaning in these steps, append to a list ##### dictionary = corpora.Dictionary(doc_clean) corpus = [dictionary.doc2bow(doc) for doc in doc_clean] ##### set the num_topics you want for each document, I set one for now ##### ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 1, id2word = dictionary, passes=20) for i in ldamodel.print_topics(): print(i) print('\n') Sample output: (0, '0.200*"brocolli" + 0.200*"eat" + 0.200*"good" + 0.133*"brother" + 0.133*"like" + 0.133*"mother"') (0, '0.097*"brocolli" + 0.097*"eat" + 0.097*"good" + 0.097*"mother" + 0.097*"brother" + 0.065*"lot" + 0.065*"spend" + 0.065*"practic" + 0.065*"around" + 0.065*"basebal"') (0, '0.060*"drive" + 0.060*"eat" + 0.060*"good" + 0.060*"mother" + 0.060*"brocolli" + 0.060*"brother" + 0.040*"pressur" + 0.040*"health" + 0.040*"caus" + 0.040*"increas"')
How to create my own dataset for keras model.fit() using Tensorflow(python)?
I want to train a simple classification neural network which can classify the data into 2 types, i.e. true or false. I have 29 data along with respective labels available with me. I want to parse this data to form a dataset which can be fed into model.fit() to train the neural network. Please suggest me how can I arrange the data with their respective labels. What to use, whether lists, dictionary, array? There are values of 2 fingerprints separated by '$' sign and whether they match or not (i.e. true or false) is separated by another '$' sign. A Fingerprint has 63 features separated by ','(comma) sign. So, Each line has the data of 2 fingerprints and true/false data. I have below data with me in following format: File Name : thumb_and_index.txt 239,1,255,255,255,255,2,0,130,3,1,105,24,152,0,192,126,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,192,0,192,0,0,0,0,0,0,0,147,18,19,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,101,22,154,0,240,30,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,0,0,0,0,0,0,0,0,0,0,0,71,150,212,$true 239,1,255,255,255,255,2,0,130,3,1,82,23,146,0,128,126,0,14,0,6,0,6,0,2,0,0,0,0,0,2,0,2,0,2,0,2,0,2,0,6,128,6,192,14,224,30,255,254,0,0,0,0,0,0,207,91,180,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,81,28,138,0,241,254,128,6,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,128,2,128,2,192,6,224,6,224,62,0,0,0,0,0,0,0,0,0,0,0,0,13,62,$true 239,1,255,255,255,255,2,0,130,3,1,92,29,147,0,224,0,192,0,192,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,192,0,192,0,224,0,224,2,240,2,248,6,255,14,76,16,0,0,0,0,19,235,73,181,0,0,0,0,$239,192,255,255,255,255,2,0,130,3,1,0,0,0,0,248,30,240,14,224,0,224,0,128,0,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,6,128,14,192,14,252,30,0,0,0,0,0,0,0,0,0,0,0,0,158,46,$false 239,1,255,255,255,255,2,0,130,3,1,0,0,0,0,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,0,0,0,0,0,0,217,85,88,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,90,27,135,0,252,254,224,126,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,190,148,$false 239,1,255,255,255,255,2,0,130,3,1,89,22,129,0,129,254,128,254,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,6,0,6,128,14,192,14,224,14,0,0,0,0,0,0,20,20,43,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,91,17,134,0,0,126,0,30,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,6,0,6,0,30,192,62,224,126,224,254,0,0,0,0,0,0,0,0,0,0,0,0,138,217,$true 239,1,255,255,255,255,2,0,130,3,1,71,36,143,0,128,254,0,14,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,6,80,18,0,0,0,0,153,213,11,95,83,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,94,30,140,0,129,254,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,192,6,0,0,0,0,0,0,0,0,0,0,0,0,54,13,$true 239,1,255,255,255,255,2,0,130,3,1,66,42,135,0,255,254,1,254,0,14,0,6,0,6,0,6,0,6,0,6,0,2,0,2,0,2,0,2,0,2,0,2,0,6,0,6,0,6,0,0,0,0,0,0,225,165,64,152,172,88,0,0,$239,1,255,255,255,255,2,0,130,3,1,62,29,137,0,255,254,249,254,240,6,224,2,224,0,224,0,224,0,224,0,224,0,224,0,224,0,240,0,240,0,240,0,240,0,240,0,240,2,0,0,0,0,0,0,0,0,0,0,0,0,0,98,$true 239,1,255,255,255,255,2,0,130,3,1,83,31,142,0,255,254,128,254,0,30,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,128,2,192,2,192,2,192,2,192,6,0,0,0,0,0,0,146,89,117,12,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,84,14,154,0,0,2,0,2,0,2,0,2,0,2,0,6,0,14,128,30,192,62,255,254,255,254,255,254,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,31,$false 239,1,255,255,255,255,2,0,130,3,1,66,41,135,0,255,254,248,62,128,30,0,14,0,14,0,14,0,14,0,14,0,14,0,6,0,6,0,6,0,14,0,14,0,14,192,14,224,14,0,0,0,0,0,0,105,213,155,107,95,23,0,0,$239,1,255,255,255,255,2,0,130,3,1,61,33,133,0,255,254,255,254,224,62,192,6,192,6,192,6,192,6,192,6,192,6,224,6,224,6,224,6,224,6,224,6,224,6,224,6,224,6,0,0,0,0,0,0,0,0,0,0,0,0,0,62,$false 239,1,255,255,255,255,2,0,130,3,1,88,31,119,0,0,14,0,14,0,6,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,133,59,150,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,97,21,137,0,128,14,0,6,0,2,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,6,0,0,0,0,0,0,0,0,0,0,0,80,147,210,$true 239,1,255,255,255,255,2,0,130,3,1,85,21,137,0,224,14,192,6,192,6,128,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,128,14,192,30,224,126,224,254,0,0,0,0,0,0,79,158,178,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,89,25,134,0,240,6,128,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,128,2,128,2,192,2,192,6,224,6,240,14,240,30,0,0,0,0,0,0,0,0,0,0,0,0,72,31,$true 239,1,255,255,255,255,2,0,130,3,1,90,25,128,0,241,254,0,30,0,6,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,6,0,6,192,14,0,0,0,0,0,0,225,153,189,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,96,12,153,0,192,14,128,6,128,6,128,6,0,6,128,2,128,2,128,2,128,6,128,6,192,14,240,30,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,18,$false 239,1,255,255,255,255,2,0,130,3,1,96,22,142,0,255,254,254,14,128,2,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,192,2,0,0,0,0,0,0,18,25,100,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,76,24,145,0,224,2,192,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,224,2,240,126,255,254,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,145,$false 239,1,255,255,255,255,2,0,130,3,1,71,33,117,0,129,254,0,30,0,14,0,14,0,6,0,6,0,2,0,2,0,6,0,6,0,6,0,6,0,6,128,14,192,14,240,30,240,254,0,0,0,0,0,0,235,85,221,57,17,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,76,31,112,0,255,254,0,62,0,62,0,62,0,14,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,128,14,224,62,0,0,0,0,0,0,0,0,0,0,0,0,30,170,$true 239,1,255,255,255,255,2,0,130,3,1,64,29,117,0,128,30,0,30,0,30,0,14,0,6,0,6,0,6,0,6,0,6,0,14,0,14,0,14,128,30,192,30,224,62,240,254,255,254,0,0,0,0,0,0,99,80,119,149,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,72,18,132,0,128,2,0,0,0,0,128,0,128,0,128,0,128,0,192,2,224,2,240,14,252,14,255,254,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,14,$false 239,1,255,255,255,255,2,0,130,3,1,82,16,132,0,255,254,255,254,255,254,240,30,224,14,224,14,192,6,192,6,192,2,192,2,192,2,192,2,192,2,192,2,192,1,224,2,240,6,0,0,0,0,0,0,215,21,0,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,85,23,130,0,240,30,192,14,128,14,128,6,128,2,128,2,128,2,128,2,128,2,128,0,192,0,192,2,192,2,224,2,224,6,240,6,248,30,0,0,0,0,0,0,0,0,0,0,0,0,0,62,$true 239,1,255,255,255,255,2,0,130,3,1,100,28,141,0,255,254,255,254,224,14,192,14,192,6,192,2,128,2,128,2,128,2,0,2,0,2,0,2,0,2,0,6,0,6,0,6,192,14,0,0,0,0,0,0,42,88,87,169,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,95,31,134,0,255,254,240,254,224,0,192,0,192,0,192,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,192,2,192,6,0,0,0,0,0,0,0,0,0,0,0,0,0,182,$true 239,1,255,255,255,255,2,0,130,3,1,88,35,121,0,255,14,240,6,224,7,192,2,192,2,192,2,192,2,192,2,192,2,192,2,192,2,224,2,224,2,224,2,224,2,224,2,224,6,0,0,0,0,0,0,36,81,48,225,153,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,81,43,112,0,252,62,248,14,224,2,192,2,192,2,192,0,192,0,192,0,192,0,192,0,192,0,192,0,224,0,224,2,224,2,224,2,224,6,0,0,0,0,0,0,0,0,0,0,0,0,0,76,$true 239,1,255,255,255,255,2,0,130,3,1,103,24,144,0,255,254,192,14,192,6,128,2,128,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,6,128,6,128,6,192,30,224,254,0,0,0,0,0,0,19,82,111,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,98,11,149,0,255,2,255,0,252,0,240,0,240,0,240,0,248,0,248,0,248,0,252,0,254,0,254,2,254,30,254,30,254,30,254,30,254,30,0,0,0,0,0,0,0,0,0,0,0,0,0,114,$false 239,1,255,255,255,255,2,0,130,3,1,92,23,123,0,255,254,255,30,252,6,240,2,224,0,192,0,192,0,192,0,224,0,224,0,224,0,224,2,224,2,224,2,224,2,224,6,224,6,0,0,0,0,0,0,35,161,251,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,52,37,125,0,255,254,255,254,224,254,192,30,192,14,128,14,128,14,128,14,128,14,128,14,128,14,128,14,128,6,0,2,0,2,0,2,192,2,0,0,0,0,0,0,0,0,0,0,0,0,0,110,$false 239,1,255,255,255,255,2,0,130,3,1,103,19,143,0,255,254,254,254,0,126,0,126,0,126,0,62,0,62,0,126,0,126,0,126,0,126,0,126,0,126,0,126,0,254,0,254,0,254,0,0,0,0,0,0,38,168,0,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,90,30,141,0,255,254,193,254,128,62,0,6,0,2,0,2,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,6,0,254,0,0,0,0,0,0,0,0,0,0,0,0,53,211,$true 239,1,255,255,255,255,2,0,130,3,1,93,34,137,0,255,254,225,254,192,14,192,2,192,2,192,2,192,2,192,0,192,0,192,0,192,0,192,0,192,0,224,2,224,2,240,6,240,14,0,0,0,0,0,0,101,4,252,164,28,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,88,31,140,0,255,254,192,62,192,14,192,14,0,6,0,6,0,6,0,6,0,2,0,2,0,2,0,2,128,2,128,6,192,6,224,14,240,30,0,0,0,0,0,0,0,0,0,0,0,0,10,97,$true 239,1,255,255,255,255,2,0,130,3,1,57,50,107,0,248,2,248,0,248,0,224,0,224,0,192,0,192,0,192,0,128,0,128,0,128,0,128,0,192,0,192,0,192,0,192,2,224,2,0,0,0,0,0,0,34,10,146,27,176,73,73,82,$239,1,255,255,255,255,2,0,130,3,1,54,42,111,0,255,254,255,254,254,126,252,6,240,2,224,2,224,2,224,0,224,0,224,0,224,0,224,0,224,0,224,0,224,0,192,0,192,0,0,0,0,0,0,0,0,0,0,0,0,0,0,225,$true 239,1,255,255,255,255,2,0,130,3,1,103,18,142,0,241,254,224,254,128,126,128,126,0,62,0,30,0,30,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,0,0,0,0,0,209,21,0,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,103,10,139,0,255,254,255,254,255,254,225,254,192,254,192,254,192,126,128,62,0,30,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,163,$true 239,1,255,255,255,255,2,0,130,3,1,85,21,132,0,248,2,248,2,248,0,240,0,240,0,240,0,240,0,240,0,240,0,240,0,248,0,248,0,252,0,252,0,252,0,254,2,255,6,0,0,0,0,0,0,94,23,110,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,76,26,133,0,129,254,128,62,0,62,0,62,0,62,0,62,0,30,0,30,0,30,0,30,0,30,0,30,0,30,0,30,128,30,192,14,224,14,0,0,0,0,0,0,0,0,0,0,0,0,222,36,$true 239,1,255,255,255,255,2,0,130,3,1,87,28,141,0,255,254,255,254,224,254,224,126,224,126,0,14,0,2,0,2,0,2,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,2,0,0,0,0,0,0,143,231,78,148,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,89,30,139,0,255,254,248,254,240,30,224,14,224,14,192,6,192,2,128,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,0,0,0,0,0,0,0,0,0,0,0,26,213,$true 239,1,255,255,255,255,2,0,130,3,1,93,25,136,0,255,254,193,254,0,254,0,62,0,30,0,30,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,0,0,0,0,0,148,210,91,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,95,23,145,0,254,254,252,30,240,2,224,0,224,0,224,0,192,0,192,0,192,0,192,6,192,6,192,6,192,6,192,6,192,6,224,6,224,14,0,0,0,0,0,0,0,0,0,0,0,0,0,30,$false 239,1,255,255,255,255,2,0,130,3,1,85,27,138,0,255,254,240,126,224,30,192,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,30,0,30,0,30,192,62,224,62,0,0,0,0,0,0,85,17,74,101,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,105,19,144,0,192,254,128,126,0,62,0,30,128,30,128,30,128,14,192,14,192,14,192,14,224,14,224,14,240,14,240,14,248,14,254,30,255,30,0,0,0,0,0,0,0,0,0,0,0,0,0,254,$false 239,1,255,255,255,255,2,0,130,3,1,86,37,116,0,255,254,254,14,252,6,248,2,240,0,240,0,224,0,192,0,192,0,128,0,0,0,0,2,0,2,0,2,0,2,0,6,0,6,0,0,0,0,0,0,94,157,90,28,219,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,99,26,130,0,255,254,248,14,240,2,224,0,192,0,192,0,192,0,128,0,192,0,192,0,192,0,192,0,224,0,240,2,248,6,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,213,$true I have used this code trying to parse the data: import tensorflow as tf import os import array as arr import numpy as np import json os.environ["TF_CPP_MIN_LOG_LEVEL"]="2" f= open("thumb_and_index.txt","r") dataset = [] if f.mode == 'r': contents =f.read() #list of lines lines = contents.splitlines() print("No. of lines : "+str(len(lines))) for line in lines: words = line.split(',') mainlist = [] list = [] flag = 0 for word in words: print("word : " + word) if '$' in word: if flag == 1: mainlist.append(list) mainlist.append(word[1:]) dataset.append(mainlist) else: mainlist.append(list) del list[0:len(list)] list.append(int(word[1:])) flag = flag + 1 else: list.append(int(word)) print(json.dumps(dataset, indent = 4)) I want to feed the parsed data into model.fit() using keras in tensorflow(python). Also I want to ask about the neural network. How many layers and nodes should I keep in my neural network? Suggest a starting point.
there's a plenty ways to do that (formating the data), you can create 2D matrix for the data that has 62 columns for the data and another array that handles the results for this data (X_data,Y_data). also you can use pandas to create dataframes for the data (same as arrays, bu it's better to show and visualize the data). example to read the textfile into pandas dataframe import pandas df = pandas.read_table('./input/dists.txt', delim_whitespace=True, names=('A', 'B', 'C')) split the data into x&y then fit it in your model for the size of the hidden layers in your neural, it's well known that the more layers you add the more accurate results you get (without considering overfitting) , so that depends on your data. I suggest you to start with a sequential layers as follows (62->2048->1024->512->128->64->sigmoid)
The best approach, especially assuming that dataset is large, is to use the tf.data dataset. There's a CSV reader built right in. The dataset api provides all the functionality you need to preprocess the dataset, it provides built-in multi-core processing, and quite a bit more. Once you have the dataset built Keras will accept it as an input directly, so fit(my_dataset, inputs=... outputs=...). The structure of the dataset api takes a little learning, but it's well worth it. Here's the primary guide with lots of examples: https://www.tensorflow.org/guide/datasets Scroll down to the section on 'Import CSV data' for poignant examples. Here's a nice example of using the dataset API with keras: How to Properly Combine TensorFlow's Dataset API and Keras?
applying the Similar function in Gensim.Doc2Vec
I am trying to get the doc2vec function to work in python 3. I Have the following code: tekstdata = [[ index, str(row["StatementOfTargetFiguresAndPoliciesForTheUnderrepresentedGender"])] for index, row in data.iterrows()] def prep (x): low = x.lower() return word_tokenize(low) def cleanMuch(data, clean): output = [] for x, y in data: z = clean(y) output.append([str(x), z]) return output tekstdata = cleanMuch(tekstdata, prep) def tagdocs(docs): output = [] for x,y in docs: output.append(gensim.models.doc2vec.TaggedDocument(y, x)) return output tekstdata = tagdocs(tekstdata) print(tekstdata[100]) vectorModel = gensim.models.doc2vec.Doc2Vec(tekstdata, size = 100, window = 4,min_count = 3, iter = 2) ranks = [] second_ranks = [] for x, y in tekstdata: print (x) print (y) inferred_vector = vectorModel.infer_vector(y) sims = vectorModel.docvecs.most_similar([inferred_vector], topn=1001, restrict_vocab = None) rank = [docid for docid, sim in sims].index(y) ranks.append(rank) All works as far as I can understand until the rank function. The error I get is that there is no zero in my list e.g. the documents I am putting in does not have 10 in list: File "C:/Users/Niels Helsø/Documents/github/Speciale/Test/Data prep.py", line 59, in <module> rank = [docid for docid, sim in sims].index(y) ValueError: '10' is not in list It seems to me that it is the similar function that does not work. the model trains on my data (1000 documents) and build a vocab which is tagged. The documentation I mainly have used is this: Gensim dokumentation Torturial I hope that some one can help. If any additional info is need please let me know. best Niels
If you're getting ValueError: '10' is not in list, you can rely on the fact that '10' is not in the list. So have you looked at the list, to see what is there, and if it matches what you expect? It's not clear from your code excerpts that tagdocs() is ever called, and thus unclear what form tekstdata is in when provided to Doc2Vec. The intent is a bit convoluted, and there's nothing to display what the data appears as in its raw, original form. But perhaps the tags you are supplying to TaggedDocument are not the required list-of-tags, but rather a simple string, which will be interpreted as a list-of-characters. As a result, even if you're supplying a tags of '10', it will be seen as ['1', '0'] – and len(vectorModel.doctags) will be just 10 (for the 10 single-digit strings). Separate comments on your setup: 1000 documents is pretty small for Doc2Vec, where most published results use tens-of-thousands to millions of documents an iter of 10-20 is more common in Doc2Vec work (and even larger values might be helpful with smaller datasets) infer_vector() often works better with non-default values in its optional parameters, especially a steps that's much larger (20-200) or a starting alpha that's more like the bulk-training default (0.025)
TensorFlow, Dataset API and flat_map operation
I'm having difficulties working with tf.contrib.data.Dataset API and wondered if some of you could help. I wanted to transform the entire skip-gram pre-processing of word2vec into this paradigm to play with the API a little bit, it involves the following operations: Sequence of tokens are loaded dynamically (to avoid loading all dataset in memory at a time), say we then start with a Stream (to be understood as Scala's way, all data is not in memory but loaded when access is needed) of sequence of tokens: seq_tokens. From any of these seq_tokens we extract skip-grams with a python function that returns a list of tuples (token, context). Select for features the column of tokens and for label the column of contexts. In pseudo-code to make it clearer it would look like above. We should be taking advantage of the framework parallelism system not to load by ourselves the data, so I would do something like first load in memory only the indices of sequences, then load sequences (inside a map, hence if not all lines are processed synchronously, data is loaded asynchronously and there's no OOM to fear), and apply a function on those sequences of tokens that would create a varying number of skip-grams that needs to be flattened. In this end, I would formally end up with data being of shape (#lines=number of skip-grams generated, #columns=2). data = range(1:N) .map(i => load(i): Seq[String]) // load: Int -> Seq[String] loads dynamically a sequence of tokens (sequences have varying length) .flat_map(s => skip_gram(s)) // skip_gram: Seq[String] -> Seq[(String, String)] with output length features = data[0] // features lables = data[1] // labels I've tried naively to do so with Dataset's API but I'm stuck, I can do something like: iterator = ( tf.contrib.data.Dataset.range(N) .map(lambda i: tf.py_func(load_data, [i], [tf.int32, tf.int32])) // (1) .flat_map(?) // (2) .make_one_shot_iterator() ) (1) TensorFlow's not happy here because sequences loaded have differents lengths... (2) Haven't managed yet to do the skip-gram part... I actually just want to call a python function that computes a sequence (of variable size) of skip-grams and flatten it so that if the return type is a matrix, then each line should be understood as a new line of the output Dataset. Thanks a lot if anyone has any idea, and don't hesitate if I forgot to mention useful precisions...
I'm just implementing the same thing; here's how I solved it: dataset = tf.data.TextLineDataset(filename) if mode == ModeKeys.TRAIN: dataset = dataset.shuffle(buffer_size=batch_size * 100) dataset = dataset.flat_map(lambda line: string_to_skip_gram(line)) dataset = dataset.batch(batch_size) In my dataset, I treat every line as standalone, so I'm not worrying about contexts that span multiple lines. I therefore flat map each line through a function string_to_skip_gram that returns a Dataset of a length that depends on the number of tokens in the line. string_to_skip_gram turns the line into a series of tokens, represented by IDs (using the method tokenize_str) using tf.py_func: def string_to_skip_gram(line): def handle_line(line): token_ids = tokenize_str(line) (features, labels) = skip_gram(token_ids) return np.array([features, labels], dtype=np.int64) res = tf.py_func(handle_line, [line], tf.int64) features = res[0] labels = res[1] return tf.data.Dataset.from_tensor_slices((features, labels)) Finally, skip_gram returns a list of all possible context words and target words: def skip_gram(token_ids): skip_window = 1 features = [] labels = [] context_range = [i for i in range(-skip_window, skip_window + 1) if i != 0] for word_index in range(skip_window, len(token_ids) - skip_window): for context_word_offset in context_range: features.append(token_ids[word_index]) labels.append(token_ids[word_index + context_word_offset]) return features, labels Note that I'm not sampling the context words here; just using all of them.