Fasttext UnicodeDecode issue

Fasttext UnicodeDecode issue - python

I am trying to load the fasttext file to use it as word embedding first time. I have this:
KeyedVectors.load_word2vec_format(binary_file_path,
binary=True, encoding='utf-8', unicode_errors='ignore')
I also tried what is described here: https://datascience.stackexchange.com/questions/20071/how-do-i-load-fasttext-pretrained-model-with-gensim
Still same results
I have downloaded the .bin file from kaggle (https://www.kaggle.com/kambarakun/fasttext-pretrained-word-vectors-english)
But still I am having the issue:
'utf8' codec can't decode byte 0xba in position 0: invalid start byte
I want to use only the .bin file and not .vec file as it takes less time.

Related

How to correctly read prophet model from JSON object stored in GCS

I have a prophet model that I have stored to Google cloud storage folder and now I want to read this model in my code to run prediction pipeline. The model object was stored as JSON using this link https://facebook.github.io/prophet/docs/additional_topics.html
For this, first I download the JSON object locally from the bucket. And then I try to use the model_from_json() method. However, I keep getting below error -
import json
from google.cloud import bigquery, storage
from prophet.serialize import model_to_json, model_from_json
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob('/GCSpath/to/.json')
blob.download_to_filename('mymodel.json') # download the file locally
with open('mymodel.json', 'r') as fin: m = model_from_json(json.load(fin))
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/python/3.7.11/lib/python3.7/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/Users/python/3.7.11/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I tried the method specified here too but it still does not work - Downloading a file from google cloud storage inside a folder
What is the correct way to save and load Prophet models?

The error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte makes reference that either your filename or some text inside your file is not formated in UTF-8.
This means that you have some special characters inside your file that are not being able to be decoded, for example it could be Cyrillic characters or even some Unicode characters. Check this here for a reference on the difference between Unicode and UTF, you will find some examples too.
I would recommend checking your files in case there are special characters that are not compatible and removing them. It also marks the position where the error was found, so you could try starting from there.
On the other hand, if reviewing file by file and removing characters is not viable, you could also try opening your files in binary.
Instead of using 'r' in the open() command:
with open('mymodel.json', 'r') as fin: m = model_from_json(json.load(fin))
Try using 'rb':
with open('mymodel.json', 'rb') as fin: m = model_from_json(json.load(fin))
This most likely will solve your problem since reading a file in binary would not try to decode bytes to strings, hence no formatting issues. You may find more information about file reading in Python here, and more about how or why to read files in binary here.

i think the error message is quite clear. 'utf-8' can not decode the format of data in your file.
when you use open(), which is a python built-in function, it expects an argument for "encoding" which is set to 'utf-8' by default.
you need to find the encoding preferable for data in your file and provide it as argument to "encoding=your-encoding-code"
Hope this helps!

Error while trying to reading csv file to Jupyter

I am trying to read a csv file to my Jupyter notebook using Python 3
address = 'C:/Users/X/Y/Z/Data.csv'
data = pd.read_csv(address)
This is the message I am getting:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 0: invalid continuation byte
Any suggestions? I am having trouble understanding what it wants from me to do with the data in order to load it.
Thanks a lot !

UnicodeDecodeError error when loading word2vec

Full Description
I am starting to work with word embedding and found a great amount of information about it. I understand, this far, that I can train my own word vectors or use previously trained ones, such as Google's or Wikipedia's, which are available for the English language and aren't useful to me, since I am working with texts in Brazilian Portuguese. Therefore, I went on a hunt for pre-trained word vectors in Portuguese and I ended up finding Hirosan's List of Pretrained Word Embeddings which led me to Kyubyong's WordVectors from which I learned about Rami Al-Rfou's Polyglot. After downloading both, I unsuccessfully have been trying to simply load the word vectors.
Short Description
I can't load pre-trained word vectors; I am trying WordVectors and Polyglot.
Downloads
Kyubyong's pre-trained word2vector format word vectors for Portuguese;
Polyglot's pre-trained word vectors for Portuguese;
Loading attempts
Kyubyong's WordVectors
First attempt: using Gensim as suggested by Hirosan;
from gensim.models import KeyedVectors
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
word_vectors = KeyedVectors.load_word2vec_format(kyu_path, binary=True)
And the error returned:
[...]
File "/Users/luisflavio/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 359, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
The zip downloaded also contains other files but all of them return similar errors.
Polyglot
First attempt: following Al-Rfous's instructions;
import pickle
import numpy
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
words, embeddings = pickle.load(open(pol_path, 'rb'))
And the error returned:
File "/Users/luisflavio/Desktop/Python/w2v_loading_tries.py", line 14, in <module>
words, embeddings = pickle.load(open(polyglot_path, "rb"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd4 in position 1: ordinal not in range(128)
Second attempt: using Polyglot's word embedding load function;
First, we have to install polyglot via pip:
pip install polyglot
Now we can import it:
from polyglot.mapping import Embedding
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
embeddings = Embedding.load(polyglot_path)
And the error returned:
File "/Users/luisflavio/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Extra Information
I am using python 3 on MacOS High Sierra.
Solutions
Kyubyong's WordVectors
As pointed out by Aneesh Joshi, the correct way to load Kyubyong's model is by calling the native load function of Word2Vec.
from gensim.models import Word2Vec
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
model = Word2Vec.load(kyu_path)
Even though I am more than grateful for Aneesh Joshi solution, polyglot seems to be a better model for working with Portuguese. Any ideas about that one?

For Kyubyong's pre-trained word2vector .bin file:
it may have been saved using gensim's save function.
"load the model with load(). Not load_word2vec_format (that's for the C-tool compatibility)."
i.e., model = Word2Vec.load(fname)
Let me know if that works.
Reference : Gensim mailing list

How to open a binary file stored in Google App Engine?

I have generated a binary file with the word2vec, stored the resulting .bin file to my GCS bucket, and ran the following code in my App Engine app handler:
gcs_file = gcs.open(filename, 'r')
content = gcs_file.read().encode("utf-8")
""" call word2vec with content so it doesn't need to read a file itself, as we don't have a filesystem in GAE """
Fails with this error:
content = gcs_file.read().encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 15: ordinal not in range(128)
A similar decode error happens if I try gcs_file.read(), or gcs_file.read().decode("utf-8").encode("utf-8").
Any ideas on how to read a binary file from GCS?
Thanks

If it is binary then it will not take a character encoding, which is what UTF-8 is. UTF-8 is just one possible binary encoding of the Unicode specification for character sets ( String data ). You need to go back and read up on what UTF-8 and ASCII represent and how they are used.
If it was not text data that was encoded with a specific encoding then it is not going to magically just decode, which is why you are getting that error. can't decode byte 0xf6 in position 15 is not a valid ASCII value.

problems of reading a collection of files including non-ascii characters

I am trying to build word vectors using the following code segment
DIR = "C:\Users\Desktop\data\\rec.sport.hockey"
posts = [open(os.path.join(DIR,f)).read() for f in os.listdir(DIR)]
x_train = vectorizer.fit_transform(posts)
However, the code returns the following error message
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 240: invalid start byte
I think it is related to some non-ascii characters. How to solve this issue?

Is the file automatically generated? If not, one simple solution is to open the file with Notepad++ and convert the encoding to utf-8.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fasttext UnicodeDecode issue - python

Related

How to correctly read prophet model from JSON object stored in GCS

Error while trying to reading csv file to Jupyter

UnicodeDecodeError error when loading word2vec

How to open a binary file stored in Google App Engine?

problems of reading a collection of files including non-ascii characters

Categories

Resources