UnicodeDecodeError error when loading word2vec - python

Full Description
I am starting to work with word embedding and found a great amount of information about it. I understand, this far, that I can train my own word vectors or use previously trained ones, such as Google's or Wikipedia's, which are available for the English language and aren't useful to me, since I am working with texts in Brazilian Portuguese. Therefore, I went on a hunt for pre-trained word vectors in Portuguese and I ended up finding Hirosan's List of Pretrained Word Embeddings which led me to Kyubyong's WordVectors from which I learned about Rami Al-Rfou's Polyglot. After downloading both, I unsuccessfully have been trying to simply load the word vectors.
Short Description
I can't load pre-trained word vectors; I am trying WordVectors and Polyglot.
Downloads
Kyubyong's pre-trained word2vector format word vectors for Portuguese;
Polyglot's pre-trained word vectors for Portuguese;
Loading attempts
Kyubyong's WordVectors
First attempt: using Gensim as suggested by Hirosan;
from gensim.models import KeyedVectors
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
word_vectors = KeyedVectors.load_word2vec_format(kyu_path, binary=True)
And the error returned:
[...]
File "/Users/luisflavio/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 359, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
The zip downloaded also contains other files but all of them return similar errors.
Polyglot
First attempt: following Al-Rfous's instructions;
import pickle
import numpy
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
words, embeddings = pickle.load(open(pol_path, 'rb'))
And the error returned:
File "/Users/luisflavio/Desktop/Python/w2v_loading_tries.py", line 14, in <module>
words, embeddings = pickle.load(open(polyglot_path, "rb"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd4 in position 1: ordinal not in range(128)
Second attempt: using Polyglot's word embedding load function;
First, we have to install polyglot via pip:
pip install polyglot
Now we can import it:
from polyglot.mapping import Embedding
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
embeddings = Embedding.load(polyglot_path)
And the error returned:
File "/Users/luisflavio/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Extra Information
I am using python 3 on MacOS High Sierra.
Solutions
Kyubyong's WordVectors
As pointed out by Aneesh Joshi, the correct way to load Kyubyong's model is by calling the native load function of Word2Vec.
from gensim.models import Word2Vec
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
model = Word2Vec.load(kyu_path)
Even though I am more than grateful for Aneesh Joshi solution, polyglot seems to be a better model for working with Portuguese. Any ideas about that one?

For Kyubyong's pre-trained word2vector .bin file:
it may have been saved using gensim's save function.
"load the model with load(). Not load_word2vec_format (that's for the C-tool compatibility)."
i.e., model = Word2Vec.load(fname)
Let me know if that works.
Reference : Gensim mailing list

Related

How to correctly read prophet model from JSON object stored in GCS

I have a prophet model that I have stored to Google cloud storage folder and now I want to read this model in my code to run prediction pipeline. The model object was stored as JSON using this link https://facebook.github.io/prophet/docs/additional_topics.html
For this, first I download the JSON object locally from the bucket. And then I try to use the model_from_json() method. However, I keep getting below error -
import json
from google.cloud import bigquery, storage
from prophet.serialize import model_to_json, model_from_json
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob('/GCSpath/to/.json')
blob.download_to_filename('mymodel.json') # download the file locally
with open('mymodel.json', 'r') as fin: m = model_from_json(json.load(fin))
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/python/3.7.11/lib/python3.7/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/Users/python/3.7.11/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I tried the method specified here too but it still does not work - Downloading a file from google cloud storage inside a folder
What is the correct way to save and load Prophet models?
The error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte makes reference that either your filename or some text inside your file is not formated in UTF-8.
This means that you have some special characters inside your file that are not being able to be decoded, for example it could be Cyrillic characters or even some Unicode characters. Check this here for a reference on the difference between Unicode and UTF, you will find some examples too.
I would recommend checking your files in case there are special characters that are not compatible and removing them. It also marks the position where the error was found, so you could try starting from there.
On the other hand, if reviewing file by file and removing characters is not viable, you could also try opening your files in binary.
Instead of using 'r' in the open() command:
with open('mymodel.json', 'r') as fin: m = model_from_json(json.load(fin))
Try using 'rb':
with open('mymodel.json', 'rb') as fin: m = model_from_json(json.load(fin))
This most likely will solve your problem since reading a file in binary would not try to decode bytes to strings, hence no formatting issues. You may find more information about file reading in Python here, and more about how or why to read files in binary here.
i think the error message is quite clear. 'utf-8' can not decode the format of data in your file.
when you use open(), which is a python built-in function, it expects an argument for "encoding" which is set to 'utf-8' by default.
you need to find the encoding preferable for data in your file and provide it as argument to "encoding=your-encoding-code"
Hope this helps!

Error when loading .bin embedding file using gensim package

I tried to load .bin embedding file using gensim but i got errors. I tried all the methods provided by gensim but couldn't rectify the error
Method 1
import gensim.models.keyedvectors as word2vec
model=word2vec.KeyedVectors.load_word2vec_format('Health_2.5reviews.s200.w10.n5.v10.cbow.bin', binary=True, unicode_errors=‘ignore')
Method 2
from gensim.models import KeyedVectors
filename='Health_2.5reviews.s200.w10.n5.v10.cbow.bin'
model=KeyedVectors.load_word2vec_format(filename,binary=True,unicode_errors=‘ignore')
Method 1 and 2 gave the error
"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position
0: invalid start byte"
Method 3
from gensim.models import Word2Vec
filename='Health_2.5reviews.s200.w10.n5.v10.cbow.bin'
model=Word2Vec.load(filename)
Method 3 gave the error
UnpicklingError: invalid load key, '\xbc'.

Fasttext UnicodeDecode issue

I am trying to load the fasttext file to use it as word embedding first time. I have this:
KeyedVectors.load_word2vec_format(binary_file_path,
binary=True, encoding='utf-8', unicode_errors='ignore')
I also tried what is described here: https://datascience.stackexchange.com/questions/20071/how-do-i-load-fasttext-pretrained-model-with-gensim
Still same results
I have downloaded the .bin file from kaggle (https://www.kaggle.com/kambarakun/fasttext-pretrained-word-vectors-english)
But still I am having the issue:
'utf8' codec can't decode byte 0xba in position 0: invalid start byte
I want to use only the .bin file and not .vec file as it takes less time.

Doc2Vec model Python 3 compatibility

I trained a doc2vec model with Python2 and I would like to use it in Python3.
When I try to load it in Python 3, I get :
Doc2Vec.load('my_doc2vec.pkl')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 0: ordinal not in range(128)
It seems to be related to a pickle compatibility issue, which I tried to solve by doing :
with open('my_doc2vec.pkl', 'rb') as inf:
data = pickle.load(inf)
data.save('my_doc2vec_python3.pkl')
Gensim saved other files which I renamed as well so they can be found when calling
de = Doc2Vec.load('my_doc2vec_python3.pkl')
The load() does not fail with UnicodeDecodeError but after the inference provides meaningless results.
I can't easily re-train it using Gensim in Python 3 as I used this model to create derived data from it, so I would have to re-run a long and complex pipeline.
How can I make the doc2vec model compatible with Python 3?
Answering my own question, this answer worked for me.
Here are the steps a bit more details :
download gensim source code, e.g clone from repo
in gensim/utils.py, edit the method unpickle to add the encoding parameter:
return _pickle.loads(f.read(), encoding='latin1')
using Python 3 and the modified gensim, load the model:
de = Doc2Vec.load('my_doc2vec.pkl')
save it:
de.save('my_doc2vec_python3.pkl')
This model should be now loadable in Python 3 with unmodified gensim.

Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.
-HP-dx2280-MT-GR541AV:~$ python prog_w2v.py
Traceback (most recent call last):
File "prog_w2v.py", line 7, in <module>
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format
header = utils.to_unicode(fin.readline())
File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode
return unicode(text, encoding, errors=errors)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.
import gensim
import time
start = time.time()
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
end = time.time()
print end-start," seconds"
I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.
You are not loading the file correctly. You should use load() instead of load_word2vec_format().
The latter is used when you train a model using the C code, and save the model in a binary format. However you are not saving the model in a binary format, and are training it using python. So you can simply use the following code and it should work:
models = gensim.models.Word2Vec.load('300features_40minwords_10context.txt')
If you save your model with:
model.wv.save(OUTPUT_FILE_PATH + 'word2vec.bin')
Then loading word2vec with load_word2vec_format method would cause the issue. To make it work, you should use:
wiki_model = KeyedVectors.load(OUTPUT_FILE_PATH + 'word2vec.bin')
The same thing also happens when you save the model with:
model.wv.save_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.txt', binary=False)
And then, want to load with KeyedVectors.load method. In this case, use:
wiki_model = KeyedVectors.load_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.bin', binary=False)
As per the other answers, knowing the way you save the file is important because there are specific ways to load it as well. But, you can simply use the flag unicode_errors='ignore' to skip this issue and load the model as you want.
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format(file_path, binary=True, unicode_errors='ignore')
By default, this flag is set to 'strict': unicode_errors='strict'.
According to the documentation, the following is given as the reason as to why errors like this occur.
unicode_errors : str, optional
default 'strict', is a string suitable to be passed as the errors
argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source
file may include word tokens truncated in the middle of a multibyte unicode character
(as is common from the original word2vec.c tool), 'ignore' or 'replace' may help.
All of the above answers are helpful, if we really can keep track of how each model was saved. But what if we have a bunch of models, that we need to load, and create a general method for it? We can use the above flag to do so.
I myself have experienced instances where I train multiple models using the original word2vec.c file, but when I try to load it into gensim, some models will load successfully, and some would give the unicode errors, I have found the above flag to be helpful and convenient.
If you saved your model with save(), you must use load()
load_word2vec_format is for the model generated by google, not for the model generated by gensim

Categories

Resources