Doc2Vec model Python 3 compatibility - python

I trained a doc2vec model with Python2 and I would like to use it in Python3.
When I try to load it in Python 3, I get :
Doc2Vec.load('my_doc2vec.pkl')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 0: ordinal not in range(128)
It seems to be related to a pickle compatibility issue, which I tried to solve by doing :
with open('my_doc2vec.pkl', 'rb') as inf:
data = pickle.load(inf)
data.save('my_doc2vec_python3.pkl')
Gensim saved other files which I renamed as well so they can be found when calling
de = Doc2Vec.load('my_doc2vec_python3.pkl')
The load() does not fail with UnicodeDecodeError but after the inference provides meaningless results.
I can't easily re-train it using Gensim in Python 3 as I used this model to create derived data from it, so I would have to re-run a long and complex pipeline.
How can I make the doc2vec model compatible with Python 3?

Answering my own question, this answer worked for me.
Here are the steps a bit more details :
download gensim source code, e.g clone from repo
in gensim/utils.py, edit the method unpickle to add the encoding parameter:
return _pickle.loads(f.read(), encoding='latin1')
using Python 3 and the modified gensim, load the model:
de = Doc2Vec.load('my_doc2vec.pkl')
save it:
de.save('my_doc2vec_python3.pkl')
This model should be now loadable in Python 3 with unmodified gensim.

Related

Can't train mobilenetssd with custom dataset with tf1 object detection api

I'm trying to train a mobilenet ssd using a custom dataset (that has arabic letters in the labels) with tensorflow object detection api and tf 1.15
I converted the data from yolo format to tfrecords using this repo, and the conversion works normally
The training starts and runs for about 800-900 steps then a checkpoint is saved, then loaded to run evaluation before resuming training, and that's when things go wrong I get this error:
File "/home/mai/anaconda3/envs/tf/lib/python3.6/site-packages/PIL/ImageFont.py", line 128, in getsize
return self.font.getsize(text)
UnicodeEncodeError: 'latin-1' codec can't encode character '\u062d' in position 0: ordinal not in range(256)
[[node map_1/while/PyFunc (defined at /home/mai/anaconda3/envs/tf/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_16/NonMaxSuppressionV5/_2543]]
0 successful operations.
0 derived errors ignored.
the letter that causes the error is the first Arabic character encountered in the eval tfrecord file
I'm training using models/research/object_detection/model_main.py
I think this is an encoding error but I can't find the file where I'd need to fix the encoding since that file keeps on calling more files that doesn't show the actual code in them
Any help fixing the eval issue would be appreciated :D

UnicodeDecodeError when using python 2.7 code on python 3.7 with cPickle

I am trying to use cPickle on a .pkl file constructed from a "parsed" .csv file. The parsing is undertaken using a pre-constructed python toolbox, which has recently been ported to python 3 from python 2 (https://github.com/GEMScienceTools/gmpe-smtk)
The code I'm using is as follows:
from smtk.parsers.esm_flatfile_parser import ESMFlatfileParser
parser=ESMFlatfileParser.autobuild("Database10","Metadata10","C:/Python37/TestX10","C:/Python37/NorthSea_Inc_SA.csv")
import cPickle
sm_database = cPickle.load(open("C:/Python37/TestX10/metadatafile.pkl","r"))
It returns the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 44: character maps to <undefined>
From what I can gather, I need to specify the encoding of my .pkl file to enable cPickle to work but I do not know what the encoding is on the file produced from the parsing of the .csv file, so I can't use cPickle to currently do so.
I used the sublime text software to find it is "hexadecimal", but this is not an accepted encoding format in Python 3.7 is it not?
If anyone knows how to determine the encoding format required, or how to make hexadecimal encoding usable in Python 3.7 their help would be much appreciated.
P.s. the modules used such as "ESMFlatfileparser" are part of a pre-constructed toolbox. Considering this, is there a chance I may need to alter the encoding in some way within this module also?
The code is opening the file in text mode ('r'), but it should be binary mode ('rb').
From the documentation for pickle.load (emphasis mine):
[The] file can be an on-disk file opened for binary reading, an io.BytesIO object, or any other custom object that meets this interface.
Since the file is being opened in binary mode there is no need to provide an encoding argument to open. It may be necessary to provide an encoding argument to pickle.load. From the same documentation:
Optional keyword arguments are fix_imports, encoding and errors, which are used to control compatibility support for pickle stream generated by Python 2. If fix_imports is true, pickle will try to map the old Python 2 names to the new names used in Python 3. The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2; these default to ‘ASCII’ and ‘strict’, respectively. The encoding can be ‘bytes’ to read these 8-bit string instances as bytes objects. Using encoding='latin1' is required for unpickling NumPy arrays and instances of datetime, date and time pickled by Python 2.
This ought to prevent the UnicodeDecodeError:
sm_database = cPickle.load(open("C:/Python37/TestX10/metadatafile.pkl","rb"))

How to read a ckpt file with python3, while it is saved using python2?

I try to read a checkpoint file with pyTorch
checkpoint = torch.load(xxx.ckpt)
The file was generated by a program written using python 2.7. I try to read the file using python 3.6 but get the following error
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8c in position 16: ordinal not in range(128)
Is it possible to read the file without downgrade python?
There are some compatibility issues in pickle between Python 2.x and Python 3.x, because of the move to unicode, you're probably saving a string as part of your model and that's why you see that error.
You can follow the recommended way to save a model in Pytorch and do:
torch.save(filename, model.state_dict())
instead of saving model. Then in Python3:
model = Model() # construct a new model
model.load_state_dict(torch.load(filename))
Another way is to unpickle in Python 2 and save it to another format that is easier to transfer between Python 2 and Python 3. For example you can save the tensors of the architecture using the Pytorch-Numpy bridge and use np.savez.
You can also try to use pickle instead of torch.load and tell it to decode ASCII strings to Python3 strings
Eventually I solve the issue by
1) create a python2 environment using anaconda
2) read the checkpoint file using pytorch, and then save it using pickle
checkpoint = torch.load("xxx.ckpt")
with open("xxx.pkl", "wb") as outfile:
pickle.dump(checkpointfile, outfile)
3) back to the python3 environment, read the file using pickle, save the file using pytorch
pkl_file = open("xxx.pkl", "rb")
data = pickle.load(pkl_file, encoding="latin1")
torch.save(data, "xxx.ckpt")

UnicodeDecodeError error when loading word2vec

Full Description
I am starting to work with word embedding and found a great amount of information about it. I understand, this far, that I can train my own word vectors or use previously trained ones, such as Google's or Wikipedia's, which are available for the English language and aren't useful to me, since I am working with texts in Brazilian Portuguese. Therefore, I went on a hunt for pre-trained word vectors in Portuguese and I ended up finding Hirosan's List of Pretrained Word Embeddings which led me to Kyubyong's WordVectors from which I learned about Rami Al-Rfou's Polyglot. After downloading both, I unsuccessfully have been trying to simply load the word vectors.
Short Description
I can't load pre-trained word vectors; I am trying WordVectors and Polyglot.
Downloads
Kyubyong's pre-trained word2vector format word vectors for Portuguese;
Polyglot's pre-trained word vectors for Portuguese;
Loading attempts
Kyubyong's WordVectors
First attempt: using Gensim as suggested by Hirosan;
from gensim.models import KeyedVectors
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
word_vectors = KeyedVectors.load_word2vec_format(kyu_path, binary=True)
And the error returned:
[...]
File "/Users/luisflavio/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 359, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
The zip downloaded also contains other files but all of them return similar errors.
Polyglot
First attempt: following Al-Rfous's instructions;
import pickle
import numpy
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
words, embeddings = pickle.load(open(pol_path, 'rb'))
And the error returned:
File "/Users/luisflavio/Desktop/Python/w2v_loading_tries.py", line 14, in <module>
words, embeddings = pickle.load(open(polyglot_path, "rb"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd4 in position 1: ordinal not in range(128)
Second attempt: using Polyglot's word embedding load function;
First, we have to install polyglot via pip:
pip install polyglot
Now we can import it:
from polyglot.mapping import Embedding
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
embeddings = Embedding.load(polyglot_path)
And the error returned:
File "/Users/luisflavio/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Extra Information
I am using python 3 on MacOS High Sierra.
Solutions
Kyubyong's WordVectors
As pointed out by Aneesh Joshi, the correct way to load Kyubyong's model is by calling the native load function of Word2Vec.
from gensim.models import Word2Vec
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
model = Word2Vec.load(kyu_path)
Even though I am more than grateful for Aneesh Joshi solution, polyglot seems to be a better model for working with Portuguese. Any ideas about that one?
For Kyubyong's pre-trained word2vector .bin file:
it may have been saved using gensim's save function.
"load the model with load(). Not load_word2vec_format (that's for the C-tool compatibility)."
i.e., model = Word2Vec.load(fname)
Let me know if that works.
Reference : Gensim mailing list

Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.
-HP-dx2280-MT-GR541AV:~$ python prog_w2v.py
Traceback (most recent call last):
File "prog_w2v.py", line 7, in <module>
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format
header = utils.to_unicode(fin.readline())
File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode
return unicode(text, encoding, errors=errors)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.
import gensim
import time
start = time.time()
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
end = time.time()
print end-start," seconds"
I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.
You are not loading the file correctly. You should use load() instead of load_word2vec_format().
The latter is used when you train a model using the C code, and save the model in a binary format. However you are not saving the model in a binary format, and are training it using python. So you can simply use the following code and it should work:
models = gensim.models.Word2Vec.load('300features_40minwords_10context.txt')
If you save your model with:
model.wv.save(OUTPUT_FILE_PATH + 'word2vec.bin')
Then loading word2vec with load_word2vec_format method would cause the issue. To make it work, you should use:
wiki_model = KeyedVectors.load(OUTPUT_FILE_PATH + 'word2vec.bin')
The same thing also happens when you save the model with:
model.wv.save_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.txt', binary=False)
And then, want to load with KeyedVectors.load method. In this case, use:
wiki_model = KeyedVectors.load_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.bin', binary=False)
As per the other answers, knowing the way you save the file is important because there are specific ways to load it as well. But, you can simply use the flag unicode_errors='ignore' to skip this issue and load the model as you want.
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format(file_path, binary=True, unicode_errors='ignore')
By default, this flag is set to 'strict': unicode_errors='strict'.
According to the documentation, the following is given as the reason as to why errors like this occur.
unicode_errors : str, optional
default 'strict', is a string suitable to be passed as the errors
argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source
file may include word tokens truncated in the middle of a multibyte unicode character
(as is common from the original word2vec.c tool), 'ignore' or 'replace' may help.
All of the above answers are helpful, if we really can keep track of how each model was saved. But what if we have a bunch of models, that we need to load, and create a general method for it? We can use the above flag to do so.
I myself have experienced instances where I train multiple models using the original word2vec.c file, but when I try to load it into gensim, some models will load successfully, and some would give the unicode errors, I have found the above flag to be helpful and convenient.
If you saved your model with save(), you must use load()
load_word2vec_format is for the model generated by google, not for the model generated by gensim

Categories

Resources