Pandas read_csv - Error tokenizing data after modifying Excel .csv file - python

I have a CSV dataset for an ML classifier. It has 2 columns and looks like this:
But this dataset is very dirty, so I decided to open it with Excel, remove "dirty" words, and save it as a new CSV file and train my ML classifier on it.
But after I saved it in Excel (using , separator and also tried , UTF-8), and when trying pd.read_csv on it, it gives me this error:
Error tokenizing data. C error: Expected 3 fields in line 4, saw 5
Then I tried to use sep=';' with read_csv, and it worked, but now all Russian characters are replaced with strange symbols:
Can somebody explain please how to repair "question"-symbols from Russian characters? encoding='UTF-8' gives this error:
'utf-8' codec can't decode byte 0xe6 in position 22: invalid continuation byte
This is what the first file looks like (not modified Excel .csv file):
When I open second file (modified):

Try opening the file with either ptcp154 or kz1048 encodings. They seem to work.

Related

How do I remove quotation marks around the fields of a CSV?

I'm working on taking csv files and putting them into a postgreSQL database. For one of the files though, every field is surrounded by quotes (When looking at it in Excel it looks normal. In notepad though, one row looks like "Firstname","Lastname","CellNumber","HomeNumber",etc. when it should look like Firstname,Lastname,CellNumber,HomeNumber). It breaks when I tried to load it into SQL.
I tried loading the file into python to do data cleaning, but i'm getting an error:
This is the code I'm running to load in the file in python:
import pandas as pd
logics = pd.read_csv("test.csv")
and this is the error I'm getting:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 28682: invalid continuation byte
I tried encoding it into utf-8, but that gave me a different error.
code:
import pandas as pd
logics = pd.read_csv("test.csv", encoding= 'utf-8')
error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 12 fields in line 53, saw 14
For whatever reason, when I manually save the file in file explorer as UTF-8 and then save it back again as a CSV file it removes the quotation marks, but I need to automate this process. Is there any way I can use python to remove these quotation marks? Is it just some different kind of encoding?
So you can add more to this, maybe pull out some of the functionality into a function called "clean_line". Below should go through your csv, and remove all " characters in any of the lines. No real need for the pandas overhead on this one, using the standard python libraries should make it faster as well.
with open("test.csv",'r')as f:
lines = f.readlines()
with open("output.csv", 'w') as f:
output=[]
for line in lines:
output.append(line.replace('"',''))
f.writelines(output)

Encoding discrepancy with Iris Dataset

After I downloaded the dataset as iris.data, I renamed it to iris.data.txt. I was trying to circumvent this reported error on SO:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 8: invalid continuation byte
After reading up, I tried this:
dataset = pd.read_csv('iris.data.txt', header=None, names=names,encoding="ISO-8859-1")
This partly solved the error but some rows were still garbage.
Then I tried to open it with Sublime, save it with utf-8 encoding and then dataset = pd.read_csv('iris.data.txt', header=None, names=names,encoding="utf-8")
But this doesn't solve the problem either. I'm running Python 3 on Mac OS. What could possibly render the data readable directly?
[EDIT]:
The datatype reads: Web archive. In Spyder, the file appears as iris.data.webarchive
If I try dataset = pd.read_csv('iris.data.webarchive', header=None), it gives this traceback:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 5
If I try dataset = pd.read_csv('iris.data', header=None), it gives FileNotFoundError: File b'iris.data' does not exist
I figured out my rookie mistake. I had to save the page as 'source' instead of 'webarchive' (which is the default Mac setting)

Codec issues in pandas read_csv

I have two text files:
https://www.dropbox.com/s/idk7k5qv2mp3d4p/bad.txt?dl=0
https://www.dropbox.com/s/x27fngacngaglyy/good.txt?dl=0
Hex editor shows bad.txt begins: "FF FE 53 00 79" and Notepad++ reports the file is UCS-2 LE BOM. I believe utf_16_le should decode this, but the following code errors with UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x53 in position 2: truncated data:
import pandas as pd
df1 = pd.read_csv("good.txt")
df2 = pd.read_csv("bad.txt", encoding="utf_16_le")
I have tried every codec I can find, but cannot get pandas to read bad.txt. I have many files like this to read in an automated context. Two questions:
Is something "wrong" with bad.txt? Is the program generating the file somehow mishandling the file?
How can I read this into a pandas df? If necessary, can I convert the file with python code? The data seems to be fine since many other programs (text editors, excel, etc.) can interpret it, but how do I get pandas to play nicely?
Update Pandas 0.20 handles this file with utf-16 codec, as expected. Thank you for those who looked at it.

Error reading csv file using pandas [duplicate]

This question already has answers here:
UnicodeDecodeError when reading CSV file in Pandas with Python
(25 answers)
Closed 5 years ago.
what i am trying is reading a csv to make a dataframe---making changes in a column---again updating/reflecting changed value into same csv(to_csv)- again trying to read that csv to make another dataframe...there i am getting an error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
my code is
import pandas as pd
df = pd.read_csv("D:\ss.csv")
df.columns #o/p is Index(['CUSTOMER_MAILID', 'False', 'True'], dtype='object')
df['True'] = df['True'] + 2 #making changes to one column of type float
df.to_csv("D:\ss.csv") #updating that .csv
df1 = pd.read_csv("D:\ss.csv") #again trying to read that csv
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
So please suggest how can i avoid the error and be able to read that csv again to a dataframe.
I know somewhere i am missing "encode = some codec type" or "decode = some type" while reading and writing to csv.
But i don't know what exactly should be changed.so need help.
Known encoding
If you know the encoding of the file you want to read in,
you can use
pd.read_csv('filename.txt', encoding='encoding')
These are the possible encodings:
https://docs.python.org/3/library/codecs.html#standard-encodings
Unknown encoding
If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work.
import chardet
import pandas as pd
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])
Is that error happening on your first read of the data, or on the second read after you write it out and read it back in again? My guess is that it's actually happening on the first read of the data, because your CSV has an encoding that isn't UTF-8.
Try opening that CSV file in Notepad++, or Excel, or LibreOffice. Does your data source have the ç (C with cedilla) character in it? If it does, then that 0xE7 byte you're seeing is probably the ç encoded in either Latin-1 or Windows-1252 (called "cp1252" in Python).
Looking at the documentation for the Pandas read_csv() function, I see it has an encoding parameter, which should be the name of the encoding you expect that CSV file to be in. So try adding encoding="cp1252" to your read_csv() call, as follows:
df = pd.read_csv(r"D:\ss.csv", encoding="cp1252")
Note that I added the character r in front of the filename, so that it will be considered a "raw string" and backslashes won't be treated specially. That way you don't get a surprise when you change the filename from ss.csv to new-ss.csv, where the string D:\new-ss.csv would be read as D, :, newline character, e, w, etc.
Anyway, try that encoding parameter on your first read_csv() call and see if it works. (It's only a guess, since I don't know your actual data. If the data file isn't private and isn't too large, try posting the data file so we can see its contents -- that would let us do better than just guessing.)
One simple solution is you can open the csv file in an editor like Sublime Text and save it with 'utf-8' encoding. Then we can easily read the file through pandas.
Above method used by importing and then detecting file type works
import chardet
import pandas as pd
import chardet
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])
Yes you'll get this error. I have work around with this problem, by opening csv file in notepad++ and changing the encoding throught Encoding menu -> convert to UTF-8. Then saving the file. Then again running python program over it.
Other solution is using codecs module in python for encoding-decoding of files. I haven't used that.
I am new to python. Ran into this exact issue when I manually changed the extension on my excel file to .csv and tried to read it with read_csv. However, if I opened the excel file and saved as csv file instead it seemed to work.

Encoding in pandas

I have a csv file that is encoded in GB2312. I had successfully read it into pandas.dataframe by the option encoding = 'GB2312'. However, when I opened the file in STATA (and did quite some manual editing) and saved it back in csv, I failed to open it in pandas. I got the following error message:
'gb2312' codec can't decode byte 0xcf in position 2044: incomplete multibyte sequence
So it seems there was some characters in the file that cannot be decoded (Indeed I can read in the first couple lines with no problem). Python has an 'ignore' option for decoding strings, but I don't know how to impose that option for read_csv.
Any help is appreciated.

Categories

Resources