UnicodeDecodeError when I read a large file using pd.read_csv

UnicodeDecodeError when I read a large file using pd.read_csv - python

train = pd.read_csv('./train_vec.csv', header=None,sep=',',names=['label', 'vec', 'vec_with_sims'])
Got the error below :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaf in position
3: invalid start byte
At first I thought it was a problem with the encoding format, but when I tried to read only part of the dataset(for example,only read 10000 rows),
train = pd.read_csv('./train_vec.csv',nrows=10000,header=None,sep=',',names=['label', 'vec', 'vec_with_sims'])
the error disappeared!
Is it because the training set is too big (2.4G)? My system configuration:
Ubuntu16.04, GTX1070, 16G memory
i think it is enough!
What's even more strange is that every time the computer restarts, the training set can be loaded normally, but after a while, trying to load the training set again will get the same error.

Please try to add encoding='unicode_escape'
For example:
train = pd.read_csv(r'./train_vec.csv', header=None,sep=',',names=['label', 'vec', 'vec_with_sims'],encoding='unicode_escape')

Related

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 2: invalid continuation byte

I'm a newbie who are interested in Machine Learning using Python. So I downloaded a dataset from https://data.world/nrippner/ols-regression-challenge and tried to read the dataset using
dataset = pd.read_csv('cancer_reg.csv').
But that error message came up. What should I do?

Usually these type of issues arise because of the encoding. You can try using these two parameters in combination and it should probably work. I'm using latin1 because of the 0x1f you provide in your error.
dataset = pd.read_csv('cancer_reg.csv',engine='python',encoding='latin1')

Try the following
dataset = pd.read_csv('cancer_reg.csv',encoding = "ISO-8859-1")

Fasttext UnicodeDecode issue

I am trying to load the fasttext file to use it as word embedding first time. I have this:
KeyedVectors.load_word2vec_format(binary_file_path,
binary=True, encoding='utf-8', unicode_errors='ignore')
I also tried what is described here: https://datascience.stackexchange.com/questions/20071/how-do-i-load-fasttext-pretrained-model-with-gensim
Still same results
I have downloaded the .bin file from kaggle (https://www.kaggle.com/kambarakun/fasttext-pretrained-word-vectors-english)
But still I am having the issue:
'utf8' codec can't decode byte 0xba in position 0: invalid start byte
I want to use only the .bin file and not .vec file as it takes less time.

Encoding discrepancy with Iris Dataset

After I downloaded the dataset as iris.data, I renamed it to iris.data.txt. I was trying to circumvent this reported error on SO:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 8: invalid continuation byte
After reading up, I tried this:
dataset = pd.read_csv('iris.data.txt', header=None, names=names,encoding="ISO-8859-1")
This partly solved the error but some rows were still garbage.
Then I tried to open it with Sublime, save it with utf-8 encoding and then dataset = pd.read_csv('iris.data.txt', header=None, names=names,encoding="utf-8")
But this doesn't solve the problem either. I'm running Python 3 on Mac OS. What could possibly render the data readable directly?
[EDIT]:
The datatype reads: Web archive. In Spyder, the file appears as iris.data.webarchive
If I try dataset = pd.read_csv('iris.data.webarchive', header=None), it gives this traceback:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 5
If I try dataset = pd.read_csv('iris.data', header=None), it gives FileNotFoundError: File b'iris.data' does not exist

I figured out my rookie mistake. I had to save the page as 'source' instead of 'webarchive' (which is the default Mac setting)

Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.
-HP-dx2280-MT-GR541AV:~$ python prog_w2v.py
Traceback (most recent call last):
File "prog_w2v.py", line 7, in <module>
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format
header = utils.to_unicode(fin.readline())
File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode
return unicode(text, encoding, errors=errors)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.
import gensim
import time
start = time.time()
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
end = time.time()
print end-start," seconds"
I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.

You are not loading the file correctly. You should use load() instead of load_word2vec_format().
The latter is used when you train a model using the C code, and save the model in a binary format. However you are not saving the model in a binary format, and are training it using python. So you can simply use the following code and it should work:
models = gensim.models.Word2Vec.load('300features_40minwords_10context.txt')

If you save your model with:
model.wv.save(OUTPUT_FILE_PATH + 'word2vec.bin')
Then loading word2vec with load_word2vec_format method would cause the issue. To make it work, you should use:
wiki_model = KeyedVectors.load(OUTPUT_FILE_PATH + 'word2vec.bin')
The same thing also happens when you save the model with:
model.wv.save_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.txt', binary=False)
And then, want to load with KeyedVectors.load method. In this case, use:
wiki_model = KeyedVectors.load_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.bin', binary=False)

As per the other answers, knowing the way you save the file is important because there are specific ways to load it as well. But, you can simply use the flag unicode_errors='ignore' to skip this issue and load the model as you want.
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format(file_path, binary=True, unicode_errors='ignore')
By default, this flag is set to 'strict': unicode_errors='strict'.
According to the documentation, the following is given as the reason as to why errors like this occur.
unicode_errors : str, optional
default 'strict', is a string suitable to be passed as the errors
argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source
file may include word tokens truncated in the middle of a multibyte unicode character
(as is common from the original word2vec.c tool), 'ignore' or 'replace' may help.
All of the above answers are helpful, if we really can keep track of how each model was saved. But what if we have a bunch of models, that we need to load, and create a general method for it? We can use the above flag to do so.
I myself have experienced instances where I train multiple models using the original word2vec.c file, but when I try to load it into gensim, some models will load successfully, and some would give the unicode errors, I have found the above flag to be helpful and convenient.

If you saved your model with save(), you must use load()
load_word2vec_format is for the model generated by google, not for the model generated by gensim

Python Encoding\Decoding for writing to a text file

I've honestly spent a lot of time on this, and it's slowly killing me. I've stripped content from a PDF and stored it in an array. Now I'm trying to pull it back out of the array and write it into a txt file. However, I do not seem to be able to make it happen because of encoding issues.
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
print kmlDescription #this prints out fine
outputFile.write(kmlDescription)
The error i'm getting is "unicodedecodeerror: ascii codec can't decode byte 0xc2 in position 213:ordinal not in range (128).
I'm just messing around now, but I've tried all kinds of ways to get this stuff to write out.
outputFile.write(kmlDescription).decode('utf-8')
Please forgive me if this is basic, I'm still learning Python (2.7).
Cheers!
EDIT1: Sample data looks something like the following:
Chart 3686 (plan, Morehead City) [ previous update 4997/11 ] NAD83 DATUM
Insert the accompanying block, showing amendments to coastline,
depths and dolphins, centred on: 34° 41´·19N., 76° 40´·43W.
Delete R 34° 43´·16N., 76° 41´·64W.
When I add the print type(raw), I get
Edit 2: When I just try to write the data, I receive the original error message (ascii codec can't decode byte...)
I will check out the suggested thread and video. Thanks folks!
Edit 3: I'm using Python 2.7
Edit 4: agf hit the nail on the head in the comments below when (s)he noticed that I was double encoding. I tried intentionally double encoding a string that had previously been working and produced the same error message that was originally thrown. Something like:
text = "Here's a string, but imagine it has some weird symbols and whatnot in it - apparently latin-1"
textEncoded = text.encode('utf-8')
textEncodedX2 = textEncoded.encode('utf-8')
outputfile.write(textEncoded) #Works!
outputfile.write(textEncodedX2) #failed
Once I figured out I was trying to double encode, the solution was the following:
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
kmlDescriptionDecode = kmlDescription.decode("latin-1")
outputFile.write(kmlDescriptionDecode)
It's working now, and I sure appreciate all of your help!!

My guess is that output file you have opened has been opened with latin1 or even utf-8 codec hence you are not able to write utf-8 encoded data to that because it tries to reconvert it, otherwise to a normally opened file you can write any arbitrary data string, here is an example recreating similar error
u = u'सच्चिदानन्द हीरानन्द वात्स्यायन '
s = u.encode('utf-8')
f = codecs.open('del.text', 'wb',encoding='latin1')
f.write(s)
output:
Traceback (most recent call last):
File "/usr/lib/wingide4.1/src/debug/tserver/_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "/usr/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
Solution:
this will work, if you don't set any codec
f = open('del.txt', 'wb')
f.write(s)
other option is to directly write to file without encoding the unicode strings, if outputFile has been opened with correct codec e.g.
f = codecs.open('del.text', 'wb',encoding='utf-8')
f.write(u)

Your error message doesn't seem to appear to relate to any of your Python syntax but actually the fact you're trying to decode a Hex value which has no equivalent in UTF-8.
HEX 0xc2 appears to represent a latin character - an uppercase A with an accent on the top. Therefore, instead of using "allTheNTMs.append(contentRaw[s1:].encode("utf-8"))", try:-
allTheNTMs.append(contentRaw[s1:].encode("latin-1"))
I'm not an expert in Python so this may not work but it would appear you're trying to encode a latin character. Given the error message you are receiving too, it would appear that when trying to encode in UTF-8, Python only looks through the first 128 entries given that your error appears to indicate that entry "0Xc2" is out of range which indeed it is out of the first 128 entries of UTF-8.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

UnicodeDecodeError when I read a large file using pd.read_csv - python

Please try to add encoding='unicode_escape' For example: train = pd.read_csv(r'./train_vec.csv', header=None,sep=',',names=['label', 'vec', 'vec_with_sims'],encoding='unicode_escape')

Related

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 2: invalid continuation byte

Fasttext UnicodeDecode issue

Encoding discrepancy with Iris Dataset

Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

Python Encoding\Decoding for writing to a text file

Categories

Resources