How do I use the wikipedia dump as a Gensim model?

How do I use the wikipedia dump as a Gensim model? - python

I am trying to use the English Wikipedia dump (https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2) as my pre-trained word2vec model using Gensim.
from gensim.models.keyedvectors import KeyedVectors
model_path = 'enwiki-latest-pages-articles.xml.bz2'
w2v_model = KeyedVectors.load_word2vec_format(model_path, binary=True)
when I do this, I get
342 with utils.smart_open(fname) as fin:
343 header = utils.to_unicode(fin.readline(), encoding=encoding)
--> 344 vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format
345 if limit:
346 vocab_size = min(vocab_size, limit)
ValueError: invalid literal for int() with base 10: '<mediawiki'
Do I have to re-download or something?

That dump file includes the actual Wikipedia articles in an XML format – no vectors. The load_word2vec_format() methods only load sets-of-vectors that were trained earlier.
Your gensim installation's docs/notebooks directory includes a number of demo Jupyter notebooks you can run. One of those, doc2vec-wikipedia.ipynb, shows training document-vectors based on the Wikipedia articles dump. (It could be adapted fairly easily to train only word-vectors instead.)
You can also view this notebook online at:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb
Note that you'll learn more from these if you run them locally, and enable logging at the INFO level. Also, this particular training may take a full day or more to run, and require a machine with 16GB or more or RAM.

Related

How to train a custom keypoint detector for drone pose estimation. Detectron2

Because I couldn't find the answer elsewhere I decided to describe my issue here.
I'm trying to create keypoints detector of the Eachine TrashCan Drone for estimating its pose.
I followed some tutorials. The first one was with the TensorFlow ObjectDetectionAPI and because I couldn't find a solution with it I tried to use detectron2. Everything was fine until I needed to register my own dataset to retrain a model.
I am running the code on Google Colab and using coco-annotator for making annotations (https://github.com/jsbroks/coco-annotator/)
I do not think that I wrongly annotated my dataset, but who knows so I will show in the hyperlink below to present it a bit for you: Picture with annotations made by me
I used that code to registrate the data:
from detectron2.data.datasets import register_coco_instances
register_coco_instances("TrashCan_train", {}, "./TrashCan_train/mask_train.json", "./TrashCan_train")
register_coco_instances("TrashCan_test", {}, "./TrashCan_test/mask_test.json", "./TrashCan_test")
This one is not giving me an error but when I'm trying to begin the training process with that code:
from detectron2.engine import DefaultTrainer
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("TrashCan_train",)
cfg.DATASETS.TEST = ("TrashCan_test")
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = ("detectron2://COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x/137849621/model_final_a6e10b.pkl") # Let training initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025 # pick a good LR
cfg.SOLVER.MAX_ITER = 25 # 300 iterations seems good enough for this toy dataset; you may need to train longer for a practical dataset
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128 # faster, and good enough for this toy dataset (default: 512)
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1 # liczba klas
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
I end up with this:
WARNING [07/14 14:36:52 d2.data.datasets.coco]:
Category ids in annotations are not in [1, #categories]! We'll apply a mapping for you.
[07/14 14:36:52 d2.data.datasets.coco]: Loaded 5 images in COCO format from ./mask_train/mask_train.json
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-12-f4f5153c62a1> in <module>()
14
15 os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
---> 16 trainer = DefaultTrainer(cfg)
17 trainer.resume_or_load(resume=False)
18 trainer.train()
7 frames
/usr/local/lib/python3.6/dist-packages/detectron2/data/datasets/coco.py in load_coco_json(json_file, image_root, dataset_name, extra_annotation_keys)
183 obj["bbox_mode"] = BoxMode.XYWH_ABS
184 if id_map:
--> 185 obj["category_id"] = id_map[obj["category_id"]]
186 objs.append(obj)
187 record["annotations"] = objs
KeyError: 9
This is where you can download my files:
https://github.com/BrunoKryszkiewicz/EachineTrashcan-keypoints/blob/master/TrashCan_Datasets.zip
and where will you find my Colab notebook:
https://colab.research.google.com/drive/1AlLZxrR64irms9mm-RiZ4DNOs2CaAEZ9?usp=sharing
I created such a small dataset because first of all, I wanted to go through threw the training process. If I will manage to do that I will make my dataset bigger.
Now I'm not even sure if is it possible to retrain a human keypoint model for obtaining keypoints of a drone-like object or any other. I need to say that I'm pretty new on the topic so I am asking for your understanding.
If you know any tutorials for creating a custom (non-human) keypoint detector I will be grateful for any information.
Best regards!

Welcome to stackoverflow!
Right off the bat I can notice some differences between your json file and the way that COCO dataset is described here the official coco website. Some keys are superfluous such as "keypoint colors". However in my experience with detectron2 superfluous keys are ignored and do not pose a problem.
The actual error message you are getting is due to the fact that you have annotations that have category_id that detectron's mapping does not account for.
There is only 1 category in the "categories" part of your mask_test.json file, but you have 2 annotations one with category_id = 9 and another with category_id = 11.
What detectron2 does is, it counts the number of categories in the categories field of the json and if they aren't numbered 1 through n it generates it's own mapping in your case it transforms 11 (your present id) into 1 (both in annotations and categories fields), but has no idea what to do with the annotation that has a category 9. And this is causes the error message on the line 185 obj["category_id"] = id_map[obj["category_id"]] because there simply isn't a mapping from 9 to whatever.
One thing to note is that AS FAR AS I CAN TELL (would be nice to ask how to do this on the github page of detectron2) you cannot train keypoint detection with multiple "classes" of keypoints using detectron2
Anyway the solution to your problem is fairly simple, just have one category with id: 1 in the categories field, and write change category_id of everything in the annotations field to 1. And run your training
EDIT: you will also need to change a few other configs for in the cfg for this to work. Namely TEST.KEYPOINT_OKS_SIGMAS = sigmas_used_for_evaluation_per_keypoint MODEL.ROI_KEYPOINT_HEAD.NUM_KEYPOINTS = number_of_keypoints_in_your_category
And what is more you need to have keypoint_flip_map, keypoint_names and keypoint_connection_rules in the metadata of your dataset.
To set this you can use MetadataCatalog.get('name_of_your_set').set(obj)
where obj will be your metadata described above

Cannot load Doc2vec object using gensim

I am trying to load a pre-trained Doc2vec model using gensim and use it to map a paragraph to a vector. I am referring to https://github.com/jhlau/doc2vec and the pre-trained model I downloaded is the English Wikipedia DBOW, which is also in the same link. However, when I load the Doc2vec model on wikipedia and infer vectors using the following code:
import gensim.models as g
import codecs
model="wiki_sg/word2vec.bin"
test_docs="test_docs.txt"
output_file="test_vectors.txt"
#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000
#load model
test_docs = [x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines()]
m = g.Doc2Vec.load(model)
#infer test vectors
output = open(output_file, "w")
for d in test_docs:
output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n")
output.flush()
output.close()
I get an error:
/Users/zhangji/Desktop/CSE547/Project/NLP/venv/lib/python2.7/site-packages/smart_open/smart_open_lib.py:402: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
Traceback (most recent call last):
File "/Users/zhangji/Desktop/CSE547/Project/NLP/AbstractMapping.py", line 19, in <module>
output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n")
AttributeError: 'Word2Vec' object has no attribute 'infer_vector'
I know there are couple of threads regarding the infer_vector issue on stack overflow, but none of them resolved my problem. I downloaded the gensim package using
pip install git+https://github.com/jhlau/gensim
In addition, after I looked at the source code in gensim package, I found that when I use Doc2vec.load(), the Doc2vec class doesn't really have a load() function by itself, but since it is a subclass of Word2vec, it calls the super method of load() in Word2vec and then make the model m a Word2vec object. However, the infer_vector() function is unique to Doc2vec and does not exist in Word2vec, and that's why it is causing the error. I also tried casting the model m to a Doc2vec, but I got this error:
>>> g.Doc2Vec(m)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/doc2vec.py", line 599, in __init__
self.build_vocab(documents, trim_rule=trim_rule)
File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/word2vec.py", line 513, in build_vocab
self.scan_vocab(sentences, trim_rule=trim_rule) # initial survey
File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/doc2vec.py", line 635, in scan_vocab
for document_no, document in enumerate(documents):
File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/word2vec.py", line 1367, in __getitem__
return vstack([self.syn0[self.vocab[word].index] for word in words])
TypeError: 'int' object is not iterable
In fact, all I want with gensim for now is to convert a paragraph to a vector using a pre-trained model that works well on academic articles. For some reasons I don't want to train the models on my own. I would be really grateful if someone can help me resolve the issue.
Btw, I am using python2.7, and the current gensim version is 0.12.4.
Thanks!

I would avoid using either the 4-year-old nonstandard gensim fork at https://github.com/jhlau/doc2vec, or any 4-year-old saved models that only load with such code.
The Wikipedia DBOW model there is also suspiciously small at 1.4GB. Wikipedia had well over 4 million articles even 4 years ago, and a 300-dimensional Doc2Vec model trained to have doc-vectors for the 4 million articles would be at least 4000000 articles * 300 dimensions * 4 bytes/dimension = 4.8GB in size, not even counting other parts of the model. (So, that download is clearly not the 4.3M doc, 300-dimensional model mentioned in the associated paper – but something that's been truncated in other unclear ways.)
The current gensim version is 3.8.3, released a few weeks ago.
It'd likely take a bit of tinkering, and an overnight or more runtime, to build your own Doc2Vec model using current code and a current Wikipedia dump - but then you'd be on modern supported code, with a modern model that better understands words coming into use in the last 4 years. (And, if you trained a model on a corpus of the exact kind of documents of interest to you – such as academic articles – the vocabulary, word-senses, and match to your own text-preprocessing to be used on later inferred documents will all be better.)
There's a Jupyter notebook example of building a Doc2Vec model from Wikipedia that either functional or very-close-to-functional inside the gensim source tree at:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb

Visualize Gensim Word2vec Embeddings in Tensorboard Projector

I've only seen a few questions that ask this, and none of them have an answer yet, so I thought I might as well try. I've been using gensim's word2vec model to create some vectors. I exported them into text, and tried importing it on tensorflow's live model of the embedding projector. One problem. It didn't work. It told me that the tensors were improperly formatted. So, being a beginner, I thought I would ask some people with more experience about possible solutions.
Equivalent to my code:
import gensim
corpus = [["words","in","sentence","one"],["words","in","sentence","two"]]
model = gensim.models.Word2Vec(iter = 5,size = 64)
model.build_vocab(corpus)
# save memory
vectors = model.wv
del model
vectors.save_word2vec_format("vect.txt",binary = False)
That creates the model, saves the vectors, and then prints the results out nice and pretty in a tab delimited file with values for all of the dimensions. I understand how to do what I'm doing, I just can't figure out what's wrong with the way I put it in tensorflow, as the documentation regarding that is pretty scarce as far as I can tell.
One idea that has been presented to me is implementing the appropriate tensorflow code, but I don’t know how to code that, just import files in the live demo.
Edit: I have a new problem now. The object I have my vectors in is non-iterable because gensim apparently decided to make its own data structures that are non-compatible with what I'm trying to do.
Ok. Done with that too! Thanks for your help!

What you are describing is possible. What you have to keep in mind is that Tensorboard reads from saved tensorflow binaries which represent your variables on disk.
More information on saving and restoring tensorflow graph and variables here
The main task is therefore to get the embeddings as saved tf variables.
Assumptions:
in the following code embeddings is a python dict {word:np.array (np.shape==[embedding_size])}
python version is 3.5+
used libraries are numpy as np, tensorflow as tf
the directory to store the tf variables is model_dir/
Step 1: Stack the embeddings to get a single np.array
embeddings_vectors = np.stack(list(embeddings.values(), axis=0))
# shape [n_words, embedding_size]
Step 2: Save the tf.Variable on disk
# Create some variables.
emb = tf.Variable(embeddings_vectors, name='word_embeddings')
# Add an op to initialize the variable.
init_op = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, initialize the variables and save the
# variables to disk.
with tf.Session() as sess:
sess.run(init_op)
# Save the variables to disk.
save_path = saver.save(sess, "model_dir/model.ckpt")
print("Model saved in path: %s" % save_path)
model_dir should contain files checkpoint, model.ckpt-1.data-00000-of-00001, model.ckpt-1.index, model.ckpt-1.meta
Step 3: Generate a metadata.tsv
To have a beautiful labeled cloud of embeddings, you can provide tensorboard with metadata as Tab-Separated Values (tsv) (cf. here).
words = '\n'.join(list(embeddings.keys()))
with open(os.path.join('model_dir', 'metadata.tsv'), 'w') as f:
f.write(words)
# .tsv file written in model_dir/metadata.tsv
Step 4: Visualize
Run $ tensorboard --logdir model_dir -> Projector.
To load metadata, the magic happens here:
As a reminder, some word2vec embedding projections are also available on http://projector.tensorflow.org/

Gensim actually has the official way to do this.
Documentation about it

The above answers didn't work for me. What I found out pretty useful was this script (will be added to gensim in the future) Source
To transform the data to metadata:
model = gensim.models.Word2Vec.load_word2vec_format(model_path, binary=True)
with open( tensorsfp, 'w+') as tensors:
with open( metadatafp, 'w+') as metadata:
for word in model.index2word:
encoded=word.encode('utf-8')
metadata.write(encoded + '\n')
vector_row = '\t'.join(map(str, model[word]))
tensors.write(vector_row + '\n')
Or follow this gist

the gemsim provide convert method word2vec to tf projector file
python -m gensim.scripts.word2vec2tensor -i ~w2v_model_file -o output_folder
add in projector wesite, upload the metadata

SpaCy: how to load Google news word2vec vectors?

I've tried several methods of loading the google news word2vec vectors (https://code.google.com/archive/p/word2vec/):
en_nlp = spacy.load('en',vector=False)
en_nlp.vocab.load_vectors_from_bin_loc('GoogleNews-vectors-negative300.bin')
The above gives:
MemoryError: Error assigning 18446744072820359357 bytes
I've also tried with the .gz packed vectors; or by loading and saving them with gensim to a new format:
from gensim.models.word2vec import Word2Vec
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('googlenews2.txt')
This file then contains the words and their word vectors on each line.
I tried to load them with:
en_nlp.vocab.load_vectors('googlenews2.txt')
but it returns "0".
What is the correct way to do this?
Update:
I can load my own created file into spacy.
I use a test.txt file with "string 0.0 0.0 ...." on each line. Then zip this txt with .bzip2 to test.txt.bz2.
Then I create a spacy compatible binary file:
spacy.vocab.write_binary_vectors('test.txt.bz2', 'test.bin')
That I can load into spacy:
nlp.vocab.load_vectors_from_bin_loc('test.bin')
This works!
However, when I do the same process for the googlenews2.txt, I get the following error:
lib/python3.6/site-packages/spacy/cfile.pyx in spacy.cfile.CFile.read_into (spacy/cfile.cpp:1279)()
OSError:

For spacy 1.x, load Google news vectors into gensim and convert to a new format (each line in .txt contains a single vector: string, vec):
from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model.wv.save_word2vec_format('googlenews.txt')
Remove the first line of the .txt:
tail -n +2 googlenews.txt > googlenews.new && mv -f googlenews.new googlenews.txt
Compress the txt as .bz2:
bzip2 googlenews.txt
Create a SpaCy compatible binary file:
spacy.vocab.write_binary_vectors('googlenews.txt.bz2','googlenews.bin')
Move the googlenews.bin to /lib/python/site-packages/spacy/data/en_google-1.0.0/vocab/googlenews.bin of your python environment.
Then load the wordvectors:
import spacy
nlp = spacy.load('en',vectors='en_google')
or load them after later:
nlp.vocab.load_vectors_from_bin_loc('googlenews.bin')

I know that this question has already been answered, but I am going to offer a simpler solution. This solution will load google news vectors into a blank spacy nlp object.
import gensim
import spacy
# Path to google news vectors
google_news_path = "path\to\google\news\\GoogleNews-vectors-negative300.bin.gz"
# Load google news vecs in gensim
model = gensim.models.KeyedVectors.load_word2vec_format(gn_path, binary=True)
# Init blank english spacy nlp object
nlp = spacy.blank('en')
# Loop through range of all indexes, get words associated with each index.
# The words in the keys list will correspond to the order of the google embed matrix
keys = []
for idx in range(3000000):
keys.append(model.index2word[idx])
# Set the vectors for our nlp object to the google news vectors
nlp.vocab.vectors = spacy.vocab.Vectors(data=model.syn0, keys=keys)
>>> nlp.vocab.vectors.shape
(3000000, 300)

I am using spaCy v2.0.10.
Create a SpaCy compatible binary file:
spacy.vocab.write_binary_vectors('googlenews.txt.bz2','googlenews.bin')
I want to highlight that the specific code in the accepted answer is not working now. I encountered "AttributeError: ..." when I run the code.
This has changed in spaCy v2. write_binary_vectors was removed in v2. From spaCy documentations, the current way to do this is as follows:
$ python -m spacy init-model en /path/to/output -v /path/to/vectors.bin.tar.gz

it is much easier to use the gensim api for dowloading the word2vec compressed model by google, it will be stored in /home/"your_username"/gensim-data/word2vec-google-news-300/ . Load the vectors and play ball. I have 16GB of RAM which is more than enough to handle the model
import gensim.downloader as api
model = api.load("word2vec-google-news-300") # download the model and return as object ready for use
word_vectors = model.wv #load the vectors from the model

training data format for NLTK punkt

I would like to run nltk Punkt to split sentences. There is no training model so I train model separately, but I am not sure if the training data format I am using is correct.
My training data is one sentence per line. I wasn't able to find any documentation about this, only this thread (https://groups.google.com/forum/#!topic/nltk-users/bxIEnmgeCSM) sheds some light about training data format.
What is the correct training data format for NLTK Punkt sentence tokenizer?

Ah yes, Punkt tokenizer is the magical unsupervised sentence boundary detection. And the author's last name is pretty cool too, Kiss and Strunk (2006). The idea is to use NO annotation to train a sentence boundary detector, hence the input will be ANY sort of plaintext (as long as the encoding is consistent).
To train a new model, simply use:
import nltk.tokenize.punkt
import pickle
import codecs
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
text = codecs.open("someplain.txt","r","utf8").read()
tokenizer.train(text)
out = open("someplain.pk","wb")
pickle.dump(tokenizer, out)
out.close()
To achieve higher precision and allow you to stop training at any time and still save a proper pickle for your tokenizer, do look at this code snippet for training a German sentence tokenizer, https://github.com/alvations/DLTK/blob/master/dltk/tokenize/tokenizer.py :
def train_punktsent(trainfile, modelfile):
""" Trains an unsupervised NLTK punkt sentence tokenizer. """
punkt = PunktTrainer()
try:
with codecs.open(trainfile, 'r','utf8') as fin:
punkt.train(fin.read(), finalize=False, verbose=False)
except KeyboardInterrupt:
print 'KeyboardInterrupt: Stopping the reading of the dump early!'
##HACK: Adds abbreviations from rb_tokenizer.
abbrv_sent = " ".join([i.strip() for i in \
codecs.open('abbrev.lex','r','utf8').readlines()])
abbrv_sent = "Start"+abbrv_sent+"End."
punkt.train(abbrv_sent,finalize=False, verbose=False)
# Finalize and outputs trained model.
punkt.finalize_training(verbose=True)
model = PunktSentenceTokenizer(punkt.get_params())
with open(modelfile, mode='wb') as fout:
pickle.dump(model, fout, protocol=pickle.HIGHEST_PROTOCOL)
return model
However do note that the period detection is very sensitive to the latin fullstop, question mark and exclamation mark. If you're going to train a punkt tokenizer for other languages that doesn't use latin orthography, you'll need to somehow hack the code to use the appropriate sentence boundary punctuation. If you're using NLTK's implementation of punkt, edit the sent_end_chars variable.
There are pre-trained models available other than the 'default' English tokenizer using nltk.tokenize.sent_tokenize(). Here they are: https://github.com/evandrix/nltk_data/tree/master/tokenizers/punkt
Edited
Note the pre-trained models are currently not available because the nltk_data github repo listed above has been removed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I use the wikipedia dump as a Gensim model? - python

Related

How to train a custom keypoint detector for drone pose estimation. Detectron2

Cannot load Doc2vec object using gensim

Visualize Gensim Word2vec Embeddings in Tensorboard Projector

SpaCy: how to load Google news word2vec vectors?

training data format for NLTK punkt

Categories

Resources