Encoding discrepancy with Iris Dataset

Encoding discrepancy with Iris Dataset - python

After I downloaded the dataset as iris.data, I renamed it to iris.data.txt. I was trying to circumvent this reported error on SO:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 8: invalid continuation byte
After reading up, I tried this:
dataset = pd.read_csv('iris.data.txt', header=None, names=names,encoding="ISO-8859-1")
This partly solved the error but some rows were still garbage.
Then I tried to open it with Sublime, save it with utf-8 encoding and then dataset = pd.read_csv('iris.data.txt', header=None, names=names,encoding="utf-8")
But this doesn't solve the problem either. I'm running Python 3 on Mac OS. What could possibly render the data readable directly?
[EDIT]:
The datatype reads: Web archive. In Spyder, the file appears as iris.data.webarchive
If I try dataset = pd.read_csv('iris.data.webarchive', header=None), it gives this traceback:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 5
If I try dataset = pd.read_csv('iris.data', header=None), it gives FileNotFoundError: File b'iris.data' does not exist

I figured out my rookie mistake. I had to save the page as 'source' instead of 'webarchive' (which is the default Mac setting)

Related

Pandas read_csv - Error tokenizing data after modifying Excel .csv file

I have a CSV dataset for an ML classifier. It has 2 columns and looks like this:
But this dataset is very dirty, so I decided to open it with Excel, remove "dirty" words, and save it as a new CSV file and train my ML classifier on it.
But after I saved it in Excel (using , separator and also tried , UTF-8), and when trying pd.read_csv on it, it gives me this error:
Error tokenizing data. C error: Expected 3 fields in line 4, saw 5
Then I tried to use sep=';' with read_csv, and it worked, but now all Russian characters are replaced with strange symbols:
Can somebody explain please how to repair "question"-symbols from Russian characters? encoding='UTF-8' gives this error:
'utf-8' codec can't decode byte 0xe6 in position 22: invalid continuation byte
This is what the first file looks like (not modified Excel .csv file):
When I open second file (modified):

Try opening the file with either ptcp154 or kz1048 encodings. They seem to work.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 2: invalid continuation byte

I'm a newbie who are interested in Machine Learning using Python. So I downloaded a dataset from https://data.world/nrippner/ols-regression-challenge and tried to read the dataset using
dataset = pd.read_csv('cancer_reg.csv').
But that error message came up. What should I do?

Usually these type of issues arise because of the encoding. You can try using these two parameters in combination and it should probably work. I'm using latin1 because of the 0x1f you provide in your error.
dataset = pd.read_csv('cancer_reg.csv',engine='python',encoding='latin1')

Try the following
dataset = pd.read_csv('cancer_reg.csv',encoding = "ISO-8859-1")

UnicodeDecodeError when I read a large file using pd.read_csv

train = pd.read_csv('./train_vec.csv', header=None,sep=',',names=['label', 'vec', 'vec_with_sims'])
Got the error below :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaf in position
3: invalid start byte
At first I thought it was a problem with the encoding format, but when I tried to read only part of the dataset(for example,only read 10000 rows),
train = pd.read_csv('./train_vec.csv',nrows=10000,header=None,sep=',',names=['label', 'vec', 'vec_with_sims'])
the error disappeared!
Is it because the training set is too big (2.4G)? My system configuration:
Ubuntu16.04, GTX1070, 16G memory
i think it is enough!
What's even more strange is that every time the computer restarts, the training set can be loaded normally, but after a while, trying to load the training set again will get the same error.

Please try to add encoding='unicode_escape'
For example:
train = pd.read_csv(r'./train_vec.csv', header=None,sep=',',names=['label', 'vec', 'vec_with_sims'],encoding='unicode_escape')

UnicodeDecodeError while loading file in python

I'm running this:
news_train = load_mlcomp('20news-18828', 'train')
vectorizer = TfidfVectorizer(encoding='latin1')
X_train = vectorizer.fit_transform((open(f, errors='ignore').read()
for f in news_train.filenames))
but it got UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 39: invalid continuation byte. at open() function.
I checked the news_train.filenames. It is:
array(['/Users/juby/Downloads/mlcomp/379/train/sci.med/12836-58920',
..., '/Users/juby/Downloads/mlcomp/379/train/sci.space/14129-61228'],
dtype='<U74')
Paths look correct. It may be about dtype or my environment (I'm Mac OSX 10.11), but I can't fix it after I tried many times. Thank you!!!
p.s it's a ML tutorial from http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#example-text-mlcomp-sparse-document-classification-py

Well I found the solution. Using
open(f, encoding = "latin1")
I'm not sure why it only happens on my mac though. Wish to know it.

Actually in Python 3+, the open function opens and reads file in default mode 'r' which will decode the file content (on most platform, in UTF-8). Since your files are encoded in latin1, decode them using UTF-8 could cause UnicodeDecodeError. The solution is either opening the files in binary mode ('rb'), or specify the correct encoding (encoding="latin1").
open(f, 'rb').read() # returns `byte` rather than `str`
# or,
open(f, encoding='latin1').read() # returns latin1 decoded `str`

UnicodeDecodeError when reading CSV File into Dataframe

I am using the code below to read a csv file into a dataframe. However, I get the error pandas.parser.CParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2 and hence I changed pd.read_csv('D:/TRYOUT.csv') to pd.read_csv('D:/TRYOUT.csv', error_bad_lines=False) as suggested here. However, I now get the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 1: invalid continuation byte in the same line.
def ExcelFileReader():
mergedf = pd.read_csv('D:/TRYOUT.csv', error_bad_lines=False)
return mergedf

If you're on Windows, you probably need to use pd.read_csv(filename, encoding='latin-1')

I had a similar problem and had to use
utf-8-sig
as the encoding,
The reason i used utf-8-sig is because if you do ever get non-Latin characters it wont be able to deal with it correctly. There are a few ways of getting around the problem, but i guess you can just choose the best that suits your needs.
Hope that helps.

If you would like to exclude the rows providing error and ignore the malformed data then you need to use:
pd.read_csv(file_path, encoding="utf8", error_bad_lines=False, encoding_errors="ignore")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Encoding discrepancy with Iris Dataset - python

I figured out my rookie mistake. I had to save the page as 'source' instead of 'webarchive' (which is the default Mac setting)

Related

Pandas read_csv - Error tokenizing data after modifying Excel .csv file

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 2: invalid continuation byte

UnicodeDecodeError when I read a large file using pd.read_csv

UnicodeDecodeError while loading file in python

UnicodeDecodeError when reading CSV File into Dataframe

Categories

Resources