Codec issues in pandas read_csv - python

I have two text files:
https://www.dropbox.com/s/idk7k5qv2mp3d4p/bad.txt?dl=0
https://www.dropbox.com/s/x27fngacngaglyy/good.txt?dl=0
Hex editor shows bad.txt begins: "FF FE 53 00 79" and Notepad++ reports the file is UCS-2 LE BOM. I believe utf_16_le should decode this, but the following code errors with UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x53 in position 2: truncated data:
import pandas as pd
df1 = pd.read_csv("good.txt")
df2 = pd.read_csv("bad.txt", encoding="utf_16_le")
I have tried every codec I can find, but cannot get pandas to read bad.txt. I have many files like this to read in an automated context. Two questions:
Is something "wrong" with bad.txt? Is the program generating the file somehow mishandling the file?
How can I read this into a pandas df? If necessary, can I convert the file with python code? The data seems to be fine since many other programs (text editors, excel, etc.) can interpret it, but how do I get pandas to play nicely?

Update Pandas 0.20 handles this file with utf-16 codec, as expected. Thank you for those who looked at it.

Related

UnicodeDecodeError when reading CSV file in Pandas with Python "'utf-8' codec can't decode byte 0xff in position 0: invalid start byte"

I am having having trouble reading a csv file using read_csv in Pandas. Here's the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
I have tried a bunch of different encoding types with the file I am dealing with and none seem to work. The file is from Google's Search Ads 360 product, which says the csv should be in the 'UFT-16' format. Strangely, if I open the file in Excel and save it as a utf-8 format, I can use read_csv normally.
I've tried the solutions to a similar problem here, but they did not work for me. This is the only code I am running:
import pandas as pd
df = pd.read_csv('path/file.csv')
Edit: I read in the file as tab delimited, and that seemed to work. I still don't understand why I got the error I did when I tried to read it in as a normal csv. Any insight into this would be appreciated!!
Try this encoding:
import pandas as pd
df = pd.read_csv('path/file.csv',encoding='cp1252')

Read a file in python having rogue byte 0xc0 that causes utf-8 and ascii to error out

Trying to read a tab-separated file into pandas dataframe:
>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False)
It errors out like so:
b'Skipping line 58: expected 11 fields, saw 12\n'
Traceback (most recent call last):
...(many lines)...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 115: invalid start byte
It seems the byte 0xc0 causes pain at both utf-8 and ascii encodings.
>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False, encoding='ascii')
...(many lines)...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 115: ordinal not in range(128)
I ran into the same issues with csv module's reader too.
If I import the file into OpenOffice Calc, it gets imported properly, the columns are properly recognized etc. Probably the offending 0xc0 byte is ignored there. This is not some vital piece of the data etc, it's probably just a fluke write error by the system that generated this file. I'll be happy to even zap the line where his occurs if it comes to that. I just want to read the file into the python program. The error_bad_lines=False option of pandas ought to have taken care of this problem but no dice. Also, the file does NOT have any content in non-english scripts that makes unicode so necessary. It's all standard english letters and numbers. I tried utf-16 utf-32 etc too but they only caused more errors of their own.
How to make python (pandas Dataframe in particular) read a file having one or more rogue byte 0xc0 characters?
Moving this answer here from another place where it got a hostile reception.
Found one standard that actually accepts (meaning, doesn't error out) byte 0xc0 :
encoding="ISO-8859-1"
Note: This entails making sure the rest of the file doesn't have unicode chars. This may be helpful for folks like me who didn't have any unicode chars in their file anyways and just wanted python to load the damn thing and both utf-8 and ascii encodings were erroring out.
More on ISO-8859-1 : What is the difference between UTF-8 and ISO-8859-1?
New command that works:
>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False, encoding='ISO-8859-1')
After reading it in, the dataframe is fine, the columns, data are all working like they did in OpenOffice Calc. I still have no idea where the offending 0xc0 byte went but it doesn't matter as I've got the data I needed.

Python pandas load csv ANSI Format as UTF-8

I want to load a CSV File with pandas in Jupyter Notebooks which contains characters like ä,ö,ü,ß.
When i open the csv file with Notepad++ here is one example row which causes trouble in ANSI Format:
Empf„nger;Empf„ngerStadt;Empf„ngerStraáe;Empf„ngerHausnr.;Empf„ngerPLZ;Empf„ngerLand
The correct UTF-8 outcome for Empf„nger should be: Empfänger
Now when i load the CSV Data in Python 3.6 pandas on Windows with the following code:
df_a = pd.read_csv('file.csv',sep=';',encoding='utf-8')
I get and Error Message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position xy: invalid continuation byte
Position 'xy' is the position where the character occurs that causes the error message
when i use the ansi format to load my csv file it works but display the umlaute incorrect.
Example code:
df_a = pd.read_csv('afile.csv',sep=';',encoding='ANSI')
Empfänger is represented as: Empf„nger
Note: i have tried to convert the file to UTF-8 in Notepad++ and load it afterwards with the pandas module but i still get the same error.
I have searched online for a solution but the provided solutions such as "change format in notepad++ to utf-8" or "use encoding='UTF-8'" or 'latin1' which gives me the same result as ANSI format or
import chardet
with open('afile.csv', 'rb') as f:
result = chardet.detect(f.readline())
df_a = pd.read_csv('afile.csv',sep=';',encoding=result['encoding'])
didnt work for me.
encoding='cp1252'
throws the following exception:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2: character maps to <undefined>
I also tried to replace Strings afterwards with the x.replace() method but the character ü disappears completely after loaded into a pandas DataFrame
If you don't know which are your file encoding, I think that the fastest approach is to open the file on a text editor, like Notepad++ to check how your file are encoding.
Then you go to the python documentation and look for the correct codec to use.
In your case , ANSI, the codec is 'mbcs', so your code will look like these
df_a = pd.read_csv('file.csv',sep=';',encoding='mbcs')
When EmpfängerStraße shows up as Empf„ngerStraáe when decoded as ”ANSI”, or more correctly cp1250 in this case, then the actual encoding of the data is most likely cp850:
print 'Empf„ngerStraáe'.decode('utf8').encode('cp1250').decode('cp850')
Or Python 3, where literal strings are already unicode strings:
print("Empf„ngerStraáe".encode("cp1250").decode("cp850"))
I couldnt find a proper solution after trying out all the well known encodings from ISO-8859-1 to 8859-15, from UTF-8 to UTF-32, from Windows-1250-1258 and nothing worked properly. So my guess is that the text encoding got corrupted during the export. My own solution to this is to load the textfile in a Dataframe with Windows-1251 as it does not cut out special characters in my text file and then replaced all broken characters with the corresponding ones. Its a rather dissatisfying solution that takes a lot of time to compute but its better than nothing.
You could use the encoding value UTF-16LE to solve the problem
pd.read_csv("./file.csv", encoding="UTF-16LE")
The file.csv should be saved using encoding UTF-16LE by NotePad++, option UCS-2 LE BOM
Best,
cp1252 works on both linux and windows to decode latin1 encoded files.
df = pd.read_csv('data.csv',sep=';',encoding='cp1252')
Although, if you are running on a windows machine, I would recommend using
df = pd.read_csv('data.csv', sep=';', encoding='mbcs')
Ironically, using 'latin1' in the encoding does not always work. Especially if you want to convert file to a different encoding.

Error reading csv file using pandas [duplicate]

This question already has answers here:
UnicodeDecodeError when reading CSV file in Pandas with Python
(25 answers)
Closed 5 years ago.
what i am trying is reading a csv to make a dataframe---making changes in a column---again updating/reflecting changed value into same csv(to_csv)- again trying to read that csv to make another dataframe...there i am getting an error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
my code is
import pandas as pd
df = pd.read_csv("D:\ss.csv")
df.columns #o/p is Index(['CUSTOMER_MAILID', 'False', 'True'], dtype='object')
df['True'] = df['True'] + 2 #making changes to one column of type float
df.to_csv("D:\ss.csv") #updating that .csv
df1 = pd.read_csv("D:\ss.csv") #again trying to read that csv
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
So please suggest how can i avoid the error and be able to read that csv again to a dataframe.
I know somewhere i am missing "encode = some codec type" or "decode = some type" while reading and writing to csv.
But i don't know what exactly should be changed.so need help.
Known encoding
If you know the encoding of the file you want to read in,
you can use
pd.read_csv('filename.txt', encoding='encoding')
These are the possible encodings:
https://docs.python.org/3/library/codecs.html#standard-encodings
Unknown encoding
If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work.
import chardet
import pandas as pd
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])
Is that error happening on your first read of the data, or on the second read after you write it out and read it back in again? My guess is that it's actually happening on the first read of the data, because your CSV has an encoding that isn't UTF-8.
Try opening that CSV file in Notepad++, or Excel, or LibreOffice. Does your data source have the ç (C with cedilla) character in it? If it does, then that 0xE7 byte you're seeing is probably the ç encoded in either Latin-1 or Windows-1252 (called "cp1252" in Python).
Looking at the documentation for the Pandas read_csv() function, I see it has an encoding parameter, which should be the name of the encoding you expect that CSV file to be in. So try adding encoding="cp1252" to your read_csv() call, as follows:
df = pd.read_csv(r"D:\ss.csv", encoding="cp1252")
Note that I added the character r in front of the filename, so that it will be considered a "raw string" and backslashes won't be treated specially. That way you don't get a surprise when you change the filename from ss.csv to new-ss.csv, where the string D:\new-ss.csv would be read as D, :, newline character, e, w, etc.
Anyway, try that encoding parameter on your first read_csv() call and see if it works. (It's only a guess, since I don't know your actual data. If the data file isn't private and isn't too large, try posting the data file so we can see its contents -- that would let us do better than just guessing.)
One simple solution is you can open the csv file in an editor like Sublime Text and save it with 'utf-8' encoding. Then we can easily read the file through pandas.
Above method used by importing and then detecting file type works
import chardet
import pandas as pd
import chardet
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])
Yes you'll get this error. I have work around with this problem, by opening csv file in notepad++ and changing the encoding throught Encoding menu -> convert to UTF-8. Then saving the file. Then again running python program over it.
Other solution is using codecs module in python for encoding-decoding of files. I haven't used that.
I am new to python. Ran into this exact issue when I manually changed the extension on my excel file to .csv and tried to read it with read_csv. However, if I opened the excel file and saved as csv file instead it seemed to work.

Encoding in pandas

I have a csv file that is encoded in GB2312. I had successfully read it into pandas.dataframe by the option encoding = 'GB2312'. However, when I opened the file in STATA (and did quite some manual editing) and saved it back in csv, I failed to open it in pandas. I got the following error message:
'gb2312' codec can't decode byte 0xcf in position 2044: incomplete multibyte sequence
So it seems there was some characters in the file that cannot be decoded (Indeed I can read in the first couple lines with no problem). Python has an 'ignore' option for decoding strings, but I don't know how to impose that option for read_csv.
Any help is appreciated.

Categories

Resources