Error when loading .bin embedding file using gensim package

Error when loading .bin embedding file using gensim package - python

I tried to load .bin embedding file using gensim but i got errors. I tried all the methods provided by gensim but couldn't rectify the error
Method 1
import gensim.models.keyedvectors as word2vec
model=word2vec.KeyedVectors.load_word2vec_format('Health_2.5reviews.s200.w10.n5.v10.cbow.bin', binary=True, unicode_errors=‘ignore')
Method 2
from gensim.models import KeyedVectors
filename='Health_2.5reviews.s200.w10.n5.v10.cbow.bin'
model=KeyedVectors.load_word2vec_format(filename,binary=True,unicode_errors=‘ignore')
Method 1 and 2 gave the error
"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position
0: invalid start byte"
Method 3
from gensim.models import Word2Vec
filename='Health_2.5reviews.s200.w10.n5.v10.cbow.bin'
model=Word2Vec.load(filename)
Method 3 gave the error
UnpicklingError: invalid load key, '\xbc'.

Related

Fasttext UnicodeDecode issue

I am trying to load the fasttext file to use it as word embedding first time. I have this:
KeyedVectors.load_word2vec_format(binary_file_path,
binary=True, encoding='utf-8', unicode_errors='ignore')
I also tried what is described here: https://datascience.stackexchange.com/questions/20071/how-do-i-load-fasttext-pretrained-model-with-gensim
Still same results
I have downloaded the .bin file from kaggle (https://www.kaggle.com/kambarakun/fasttext-pretrained-word-vectors-english)
But still I am having the issue:
'utf8' codec can't decode byte 0xba in position 0: invalid start byte
I want to use only the .bin file and not .vec file as it takes less time.

UnicodeDecodeError error when loading word2vec

Full Description
I am starting to work with word embedding and found a great amount of information about it. I understand, this far, that I can train my own word vectors or use previously trained ones, such as Google's or Wikipedia's, which are available for the English language and aren't useful to me, since I am working with texts in Brazilian Portuguese. Therefore, I went on a hunt for pre-trained word vectors in Portuguese and I ended up finding Hirosan's List of Pretrained Word Embeddings which led me to Kyubyong's WordVectors from which I learned about Rami Al-Rfou's Polyglot. After downloading both, I unsuccessfully have been trying to simply load the word vectors.
Short Description
I can't load pre-trained word vectors; I am trying WordVectors and Polyglot.
Downloads
Kyubyong's pre-trained word2vector format word vectors for Portuguese;
Polyglot's pre-trained word vectors for Portuguese;
Loading attempts
Kyubyong's WordVectors
First attempt: using Gensim as suggested by Hirosan;
from gensim.models import KeyedVectors
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
word_vectors = KeyedVectors.load_word2vec_format(kyu_path, binary=True)
And the error returned:
[...]
File "/Users/luisflavio/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 359, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
The zip downloaded also contains other files but all of them return similar errors.
Polyglot
First attempt: following Al-Rfous's instructions;
import pickle
import numpy
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
words, embeddings = pickle.load(open(pol_path, 'rb'))
And the error returned:
File "/Users/luisflavio/Desktop/Python/w2v_loading_tries.py", line 14, in <module>
words, embeddings = pickle.load(open(polyglot_path, "rb"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd4 in position 1: ordinal not in range(128)
Second attempt: using Polyglot's word embedding load function;
First, we have to install polyglot via pip:
pip install polyglot
Now we can import it:
from polyglot.mapping import Embedding
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
embeddings = Embedding.load(polyglot_path)
And the error returned:
File "/Users/luisflavio/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Extra Information
I am using python 3 on MacOS High Sierra.
Solutions
Kyubyong's WordVectors
As pointed out by Aneesh Joshi, the correct way to load Kyubyong's model is by calling the native load function of Word2Vec.
from gensim.models import Word2Vec
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
model = Word2Vec.load(kyu_path)
Even though I am more than grateful for Aneesh Joshi solution, polyglot seems to be a better model for working with Portuguese. Any ideas about that one?

For Kyubyong's pre-trained word2vector .bin file:
it may have been saved using gensim's save function.
"load the model with load(). Not load_word2vec_format (that's for the C-tool compatibility)."
i.e., model = Word2Vec.load(fname)
Let me know if that works.
Reference : Gensim mailing list

Doc2Vec model Python 3 compatibility

I trained a doc2vec model with Python2 and I would like to use it in Python3.
When I try to load it in Python 3, I get :
Doc2Vec.load('my_doc2vec.pkl')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 0: ordinal not in range(128)
It seems to be related to a pickle compatibility issue, which I tried to solve by doing :
with open('my_doc2vec.pkl', 'rb') as inf:
data = pickle.load(inf)
data.save('my_doc2vec_python3.pkl')
Gensim saved other files which I renamed as well so they can be found when calling
de = Doc2Vec.load('my_doc2vec_python3.pkl')
The load() does not fail with UnicodeDecodeError but after the inference provides meaningless results.
I can't easily re-train it using Gensim in Python 3 as I used this model to create derived data from it, so I would have to re-run a long and complex pipeline.
How can I make the doc2vec model compatible with Python 3?

Answering my own question, this answer worked for me.
Here are the steps a bit more details :
download gensim source code, e.g clone from repo
in gensim/utils.py, edit the method unpickle to add the encoding parameter:
return _pickle.loads(f.read(), encoding='latin1')
using Python 3 and the modified gensim, load the model:
de = Doc2Vec.load('my_doc2vec.pkl')
save it:
de.save('my_doc2vec_python3.pkl')
This model should be now loadable in Python 3 with unmodified gensim.

Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.
-HP-dx2280-MT-GR541AV:~$ python prog_w2v.py
Traceback (most recent call last):
File "prog_w2v.py", line 7, in <module>
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format
header = utils.to_unicode(fin.readline())
File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode
return unicode(text, encoding, errors=errors)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.
import gensim
import time
start = time.time()
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
end = time.time()
print end-start," seconds"
I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.

You are not loading the file correctly. You should use load() instead of load_word2vec_format().
The latter is used when you train a model using the C code, and save the model in a binary format. However you are not saving the model in a binary format, and are training it using python. So you can simply use the following code and it should work:
models = gensim.models.Word2Vec.load('300features_40minwords_10context.txt')

If you save your model with:
model.wv.save(OUTPUT_FILE_PATH + 'word2vec.bin')
Then loading word2vec with load_word2vec_format method would cause the issue. To make it work, you should use:
wiki_model = KeyedVectors.load(OUTPUT_FILE_PATH + 'word2vec.bin')
The same thing also happens when you save the model with:
model.wv.save_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.txt', binary=False)
And then, want to load with KeyedVectors.load method. In this case, use:
wiki_model = KeyedVectors.load_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.bin', binary=False)

As per the other answers, knowing the way you save the file is important because there are specific ways to load it as well. But, you can simply use the flag unicode_errors='ignore' to skip this issue and load the model as you want.
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format(file_path, binary=True, unicode_errors='ignore')
By default, this flag is set to 'strict': unicode_errors='strict'.
According to the documentation, the following is given as the reason as to why errors like this occur.
unicode_errors : str, optional
default 'strict', is a string suitable to be passed as the errors
argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source
file may include word tokens truncated in the middle of a multibyte unicode character
(as is common from the original word2vec.c tool), 'ignore' or 'replace' may help.
All of the above answers are helpful, if we really can keep track of how each model was saved. But what if we have a bunch of models, that we need to load, and create a general method for it? We can use the above flag to do so.
I myself have experienced instances where I train multiple models using the original word2vec.c file, but when I try to load it into gensim, some models will load successfully, and some would give the unicode errors, I have found the above flag to be helpful and convenient.

If you saved your model with save(), you must use load()
load_word2vec_format is for the model generated by google, not for the model generated by gensim

Python (nltk) - UnicodeDecodeError: 'ascii' codec can't decode byte

I'm new to NLTK. I'm getting this error and I've searched around for encoding/decoding and specifically the UnicodeDecodeError but this error seems specific to the NLTK source code.
Here's the error:
Traceback (most recent call last):
File "A:\Python\Projects\Test\main.py", line 2, in <module>
print(pos_tag(word_tokenize("John's big idea isn't all that bad.")))
File "A:\Python\Python\lib\site-packages\nltk\tag\__init__.py", line 100, in pos_tag
tagger = load(_POS_TAGGER)
File "A:\Python\Python\lib\site-packages\nltk\data.py", line 779, in load
resource_val = pickle.load(opened_resource)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)
How do I go around fixing this error?
Here's what causes the error:
from nltk import pos_tag, word_tokenize
print(pos_tag(word_tokenize("John's big idea isn't all that bad.")))

try this... NLTK 3.0.1 with Python 2.7.x
import io
f = io.open(txtFile, 'rU', encoding='utf-8')

I had the same problem with you. I use Python 3.4 in Windows 7.
I had installed the "nltk-3.0.0.win32.exe" (from here). But when i installed the "nltk-3.0a4.win32.exe" (from here), my problem with nltk.pos_tag was solved. Check it.
EDIT: If the second link doesn't work, you can look here.

Duplicate: NLTK 3 POS_TAG throws UnicodeDecodeError
Long story short: NLTK isn't compatible with Python 3. You have to use NLTK 3 which sounds a bit experimental at this point.

Try using the module "textclean"
>>> pip install textclean
Python code
from textclean.textclean import textclean
text = textclean.clean("John's big idea isn't all that bad.")
print pos_tag(word_tokenize(text))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error when loading .bin embedding file using gensim package - python

Related

Fasttext UnicodeDecode issue

UnicodeDecodeError error when loading word2vec

Doc2Vec model Python 3 compatibility

Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

Python (nltk) - UnicodeDecodeError: 'ascii' codec can't decode byte

Categories

Resources