SpaCy: how to load Google news word2vec vectors? - python

I've tried several methods of loading the google news word2vec vectors (https://code.google.com/archive/p/word2vec/):
en_nlp = spacy.load('en',vector=False)
en_nlp.vocab.load_vectors_from_bin_loc('GoogleNews-vectors-negative300.bin')
The above gives:
MemoryError: Error assigning 18446744072820359357 bytes
I've also tried with the .gz packed vectors; or by loading and saving them with gensim to a new format:
from gensim.models.word2vec import Word2Vec
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('googlenews2.txt')
This file then contains the words and their word vectors on each line.
I tried to load them with:
en_nlp.vocab.load_vectors('googlenews2.txt')
but it returns "0".
What is the correct way to do this?
Update:
I can load my own created file into spacy.
I use a test.txt file with "string 0.0 0.0 ...." on each line. Then zip this txt with .bzip2 to test.txt.bz2.
Then I create a spacy compatible binary file:
spacy.vocab.write_binary_vectors('test.txt.bz2', 'test.bin')
That I can load into spacy:
nlp.vocab.load_vectors_from_bin_loc('test.bin')
This works!
However, when I do the same process for the googlenews2.txt, I get the following error:
lib/python3.6/site-packages/spacy/cfile.pyx in spacy.cfile.CFile.read_into (spacy/cfile.cpp:1279)()
OSError:

For spacy 1.x, load Google news vectors into gensim and convert to a new format (each line in .txt contains a single vector: string, vec):
from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model.wv.save_word2vec_format('googlenews.txt')
Remove the first line of the .txt:
tail -n +2 googlenews.txt > googlenews.new && mv -f googlenews.new googlenews.txt
Compress the txt as .bz2:
bzip2 googlenews.txt
Create a SpaCy compatible binary file:
spacy.vocab.write_binary_vectors('googlenews.txt.bz2','googlenews.bin')
Move the googlenews.bin to /lib/python/site-packages/spacy/data/en_google-1.0.0/vocab/googlenews.bin of your python environment.
Then load the wordvectors:
import spacy
nlp = spacy.load('en',vectors='en_google')
or load them after later:
nlp.vocab.load_vectors_from_bin_loc('googlenews.bin')

I know that this question has already been answered, but I am going to offer a simpler solution. This solution will load google news vectors into a blank spacy nlp object.
import gensim
import spacy
# Path to google news vectors
google_news_path = "path\to\google\news\\GoogleNews-vectors-negative300.bin.gz"
# Load google news vecs in gensim
model = gensim.models.KeyedVectors.load_word2vec_format(gn_path, binary=True)
# Init blank english spacy nlp object
nlp = spacy.blank('en')
# Loop through range of all indexes, get words associated with each index.
# The words in the keys list will correspond to the order of the google embed matrix
keys = []
for idx in range(3000000):
keys.append(model.index2word[idx])
# Set the vectors for our nlp object to the google news vectors
nlp.vocab.vectors = spacy.vocab.Vectors(data=model.syn0, keys=keys)
>>> nlp.vocab.vectors.shape
(3000000, 300)

I am using spaCy v2.0.10.
Create a SpaCy compatible binary file:
spacy.vocab.write_binary_vectors('googlenews.txt.bz2','googlenews.bin')
I want to highlight that the specific code in the accepted answer is not working now. I encountered "AttributeError: ..." when I run the code.
This has changed in spaCy v2. write_binary_vectors was removed in v2. From spaCy documentations, the current way to do this is as follows:
$ python -m spacy init-model en /path/to/output -v /path/to/vectors.bin.tar.gz

it is much easier to use the gensim api for dowloading the word2vec compressed model by google, it will be stored in /home/"your_username"/gensim-data/word2vec-google-news-300/ . Load the vectors and play ball. I have 16GB of RAM which is more than enough to handle the model
import gensim.downloader as api
model = api.load("word2vec-google-news-300") # download the model and return as object ready for use
word_vectors = model.wv #load the vectors from the model

Related

understanding what keras and TensorFlow to use in text classification

I was trying to classify my text in tensorflow and keras and every time I tried using the keras to read my files from my directory then it would through and error that the features for reading the text was not available yet it is included in the documentation
I made my own file reader functionality which is here how to read text files in keras using os.walk and converting to batched dataset and now trying to vectorize my text using keras preprocessing then again the module is not available
as asked in the comment I was trying to use keras https://keras.io/api/preprocessing/text/ guide and the one https://keras.io/examples/nlp/text_classification_from_scratch/ here which uses the vectorizing, I started digging into uisng tf to make tokens from the text and I found how to vectorize the text here https://www.tensorflow.org/tutorials/text/word2vec
but now the problem was that I could not use the functions because I could not understand it very well my code was as follows
train_dataset = get_files_from_dir(train_path,batch_size=batch_size, seed=seed) # calls the fetch text which returns dataset of text and labels batched
text_ds = train_dataset.map(lambda x, y: x) # get featues only (text with no labels)
def vec_maker(text):
tokens = text.lower().split()
vocab, index = {}, 1 # start indexing from 1
vocab['<pad>'] = 0 # add a padding token
for token in tokens:
if token not in vocab:
vocab[token] = index
index += 1
self.vocab =vocab
return text
now my problem is how do I map the text_ds to the function to make the vectors because if I try passing the variable as function arguments it direct is says that
File"/home/kim/Desktop/programs/python/text_processing/prog/text_process.py", line 78, in vec_maker
tokens = text.lower().split()
AttributeError: 'MapDataset' object has no attribute 'lower'
help and explanation will much be appreciated

Cannot load Doc2vec object using gensim

I am trying to load a pre-trained Doc2vec model using gensim and use it to map a paragraph to a vector. I am referring to https://github.com/jhlau/doc2vec and the pre-trained model I downloaded is the English Wikipedia DBOW, which is also in the same link. However, when I load the Doc2vec model on wikipedia and infer vectors using the following code:
import gensim.models as g
import codecs
model="wiki_sg/word2vec.bin"
test_docs="test_docs.txt"
output_file="test_vectors.txt"
#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000
#load model
test_docs = [x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines()]
m = g.Doc2Vec.load(model)
#infer test vectors
output = open(output_file, "w")
for d in test_docs:
output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n")
output.flush()
output.close()
I get an error:
/Users/zhangji/Desktop/CSE547/Project/NLP/venv/lib/python2.7/site-packages/smart_open/smart_open_lib.py:402: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
Traceback (most recent call last):
File "/Users/zhangji/Desktop/CSE547/Project/NLP/AbstractMapping.py", line 19, in <module>
output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n")
AttributeError: 'Word2Vec' object has no attribute 'infer_vector'
I know there are couple of threads regarding the infer_vector issue on stack overflow, but none of them resolved my problem. I downloaded the gensim package using
pip install git+https://github.com/jhlau/gensim
In addition, after I looked at the source code in gensim package, I found that when I use Doc2vec.load(), the Doc2vec class doesn't really have a load() function by itself, but since it is a subclass of Word2vec, it calls the super method of load() in Word2vec and then make the model m a Word2vec object. However, the infer_vector() function is unique to Doc2vec and does not exist in Word2vec, and that's why it is causing the error. I also tried casting the model m to a Doc2vec, but I got this error:
>>> g.Doc2Vec(m)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/doc2vec.py", line 599, in __init__
self.build_vocab(documents, trim_rule=trim_rule)
File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/word2vec.py", line 513, in build_vocab
self.scan_vocab(sentences, trim_rule=trim_rule) # initial survey
File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/doc2vec.py", line 635, in scan_vocab
for document_no, document in enumerate(documents):
File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/word2vec.py", line 1367, in __getitem__
return vstack([self.syn0[self.vocab[word].index] for word in words])
TypeError: 'int' object is not iterable
In fact, all I want with gensim for now is to convert a paragraph to a vector using a pre-trained model that works well on academic articles. For some reasons I don't want to train the models on my own. I would be really grateful if someone can help me resolve the issue.
Btw, I am using python2.7, and the current gensim version is 0.12.4.
Thanks!
I would avoid using either the 4-year-old nonstandard gensim fork at https://github.com/jhlau/doc2vec, or any 4-year-old saved models that only load with such code.
The Wikipedia DBOW model there is also suspiciously small at 1.4GB. Wikipedia had well over 4 million articles even 4 years ago, and a 300-dimensional Doc2Vec model trained to have doc-vectors for the 4 million articles would be at least 4000000 articles * 300 dimensions * 4 bytes/dimension = 4.8GB in size, not even counting other parts of the model. (So, that download is clearly not the 4.3M doc, 300-dimensional model mentioned in the associated paper – but something that's been truncated in other unclear ways.)
The current gensim version is 3.8.3, released a few weeks ago.
It'd likely take a bit of tinkering, and an overnight or more runtime, to build your own Doc2Vec model using current code and a current Wikipedia dump - but then you'd be on modern supported code, with a modern model that better understands words coming into use in the last 4 years. (And, if you trained a model on a corpus of the exact kind of documents of interest to you – such as academic articles – the vocabulary, word-senses, and match to your own text-preprocessing to be used on later inferred documents will all be better.)
There's a Jupyter notebook example of building a Doc2Vec model from Wikipedia that either functional or very-close-to-functional inside the gensim source tree at:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb

How do I use the wikipedia dump as a Gensim model?

I am trying to use the English Wikipedia dump (https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2) as my pre-trained word2vec model using Gensim.
from gensim.models.keyedvectors import KeyedVectors
model_path = 'enwiki-latest-pages-articles.xml.bz2'
w2v_model = KeyedVectors.load_word2vec_format(model_path, binary=True)
when I do this, I get
342 with utils.smart_open(fname) as fin:
343 header = utils.to_unicode(fin.readline(), encoding=encoding)
--> 344 vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format
345 if limit:
346 vocab_size = min(vocab_size, limit)
ValueError: invalid literal for int() with base 10: '<mediawiki'
Do I have to re-download or something?
That dump file includes the actual Wikipedia articles in an XML format – no vectors. The load_word2vec_format() methods only load sets-of-vectors that were trained earlier.
Your gensim installation's docs/notebooks directory includes a number of demo Jupyter notebooks you can run. One of those, doc2vec-wikipedia.ipynb, shows training document-vectors based on the Wikipedia articles dump. (It could be adapted fairly easily to train only word-vectors instead.)
You can also view this notebook online at:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb
Note that you'll learn more from these if you run them locally, and enable logging at the INFO level. Also, this particular training may take a full day or more to run, and require a machine with 16GB or more or RAM.

Visualize Gensim Word2vec Embeddings in Tensorboard Projector

I've only seen a few questions that ask this, and none of them have an answer yet, so I thought I might as well try. I've been using gensim's word2vec model to create some vectors. I exported them into text, and tried importing it on tensorflow's live model of the embedding projector. One problem. It didn't work. It told me that the tensors were improperly formatted. So, being a beginner, I thought I would ask some people with more experience about possible solutions.
Equivalent to my code:
import gensim
corpus = [["words","in","sentence","one"],["words","in","sentence","two"]]
model = gensim.models.Word2Vec(iter = 5,size = 64)
model.build_vocab(corpus)
# save memory
vectors = model.wv
del model
vectors.save_word2vec_format("vect.txt",binary = False)
That creates the model, saves the vectors, and then prints the results out nice and pretty in a tab delimited file with values for all of the dimensions. I understand how to do what I'm doing, I just can't figure out what's wrong with the way I put it in tensorflow, as the documentation regarding that is pretty scarce as far as I can tell.
One idea that has been presented to me is implementing the appropriate tensorflow code, but I don’t know how to code that, just import files in the live demo.
Edit: I have a new problem now. The object I have my vectors in is non-iterable because gensim apparently decided to make its own data structures that are non-compatible with what I'm trying to do.
Ok. Done with that too! Thanks for your help!
What you are describing is possible. What you have to keep in mind is that Tensorboard reads from saved tensorflow binaries which represent your variables on disk.
More information on saving and restoring tensorflow graph and variables here
The main task is therefore to get the embeddings as saved tf variables.
Assumptions:
in the following code embeddings is a python dict {word:np.array (np.shape==[embedding_size])}
python version is 3.5+
used libraries are numpy as np, tensorflow as tf
the directory to store the tf variables is model_dir/
Step 1: Stack the embeddings to get a single np.array
embeddings_vectors = np.stack(list(embeddings.values(), axis=0))
# shape [n_words, embedding_size]
Step 2: Save the tf.Variable on disk
# Create some variables.
emb = tf.Variable(embeddings_vectors, name='word_embeddings')
# Add an op to initialize the variable.
init_op = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, initialize the variables and save the
# variables to disk.
with tf.Session() as sess:
sess.run(init_op)
# Save the variables to disk.
save_path = saver.save(sess, "model_dir/model.ckpt")
print("Model saved in path: %s" % save_path)
model_dir should contain files checkpoint, model.ckpt-1.data-00000-of-00001, model.ckpt-1.index, model.ckpt-1.meta
Step 3: Generate a metadata.tsv
To have a beautiful labeled cloud of embeddings, you can provide tensorboard with metadata as Tab-Separated Values (tsv) (cf. here).
words = '\n'.join(list(embeddings.keys()))
with open(os.path.join('model_dir', 'metadata.tsv'), 'w') as f:
f.write(words)
# .tsv file written in model_dir/metadata.tsv
Step 4: Visualize
Run $ tensorboard --logdir model_dir -> Projector.
To load metadata, the magic happens here:
As a reminder, some word2vec embedding projections are also available on http://projector.tensorflow.org/
Gensim actually has the official way to do this.
Documentation about it
The above answers didn't work for me. What I found out pretty useful was this script (will be added to gensim in the future) Source
To transform the data to metadata:
model = gensim.models.Word2Vec.load_word2vec_format(model_path, binary=True)
with open( tensorsfp, 'w+') as tensors:
with open( metadatafp, 'w+') as metadata:
for word in model.index2word:
encoded=word.encode('utf-8')
metadata.write(encoded + '\n')
vector_row = '\t'.join(map(str, model[word]))
tensors.write(vector_row + '\n')
Or follow this gist
the gemsim provide convert method word2vec to tf projector file
python -m gensim.scripts.word2vec2tensor -i ~w2v_model_file -o output_folder
add in projector wesite, upload the metadata

training data format for NLTK punkt

I would like to run nltk Punkt to split sentences. There is no training model so I train model separately, but I am not sure if the training data format I am using is correct.
My training data is one sentence per line. I wasn't able to find any documentation about this, only this thread (https://groups.google.com/forum/#!topic/nltk-users/bxIEnmgeCSM) sheds some light about training data format.
What is the correct training data format for NLTK Punkt sentence tokenizer?
Ah yes, Punkt tokenizer is the magical unsupervised sentence boundary detection. And the author's last name is pretty cool too, Kiss and Strunk (2006). The idea is to use NO annotation to train a sentence boundary detector, hence the input will be ANY sort of plaintext (as long as the encoding is consistent).
To train a new model, simply use:
import nltk.tokenize.punkt
import pickle
import codecs
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
text = codecs.open("someplain.txt","r","utf8").read()
tokenizer.train(text)
out = open("someplain.pk","wb")
pickle.dump(tokenizer, out)
out.close()
To achieve higher precision and allow you to stop training at any time and still save a proper pickle for your tokenizer, do look at this code snippet for training a German sentence tokenizer, https://github.com/alvations/DLTK/blob/master/dltk/tokenize/tokenizer.py :
def train_punktsent(trainfile, modelfile):
""" Trains an unsupervised NLTK punkt sentence tokenizer. """
punkt = PunktTrainer()
try:
with codecs.open(trainfile, 'r','utf8') as fin:
punkt.train(fin.read(), finalize=False, verbose=False)
except KeyboardInterrupt:
print 'KeyboardInterrupt: Stopping the reading of the dump early!'
##HACK: Adds abbreviations from rb_tokenizer.
abbrv_sent = " ".join([i.strip() for i in \
codecs.open('abbrev.lex','r','utf8').readlines()])
abbrv_sent = "Start"+abbrv_sent+"End."
punkt.train(abbrv_sent,finalize=False, verbose=False)
# Finalize and outputs trained model.
punkt.finalize_training(verbose=True)
model = PunktSentenceTokenizer(punkt.get_params())
with open(modelfile, mode='wb') as fout:
pickle.dump(model, fout, protocol=pickle.HIGHEST_PROTOCOL)
return model
However do note that the period detection is very sensitive to the latin fullstop, question mark and exclamation mark. If you're going to train a punkt tokenizer for other languages that doesn't use latin orthography, you'll need to somehow hack the code to use the appropriate sentence boundary punctuation. If you're using NLTK's implementation of punkt, edit the sent_end_chars variable.
There are pre-trained models available other than the 'default' English tokenizer using nltk.tokenize.sent_tokenize(). Here they are: https://github.com/evandrix/nltk_data/tree/master/tokenizers/punkt
Edited
Note the pre-trained models are currently not available because the nltk_data github repo listed above has been removed.

Categories

Resources