Encoding in pandas - python

I have a csv file that is encoded in GB2312. I had successfully read it into pandas.dataframe by the option encoding = 'GB2312'. However, when I opened the file in STATA (and did quite some manual editing) and saved it back in csv, I failed to open it in pandas. I got the following error message:
'gb2312' codec can't decode byte 0xcf in position 2044: incomplete multibyte sequence
So it seems there was some characters in the file that cannot be decoded (Indeed I can read in the first couple lines with no problem). Python has an 'ignore' option for decoding strings, but I don't know how to impose that option for read_csv.
Any help is appreciated.

Related

'charmap' codec can't encode character '\u0110' in position 1: character maps to <undefined> while reading csv file using pd.read_csv

So I have searched a lot in google why I am getting this error, but wherever I go, the solution is to use a different encoding like cp1252, iso-8859-1, latin1 or utf-8. I am actually using utf-8 an tried all other encodings too while using pd.read_csv.
When I read the csv in a different PC with the same encoding, it does not throw this error. So I think this is a fault with my local machine.
This is how i read my csv :
dataframe = pd.read_csv(csv_path + file_name, dtype='object', encoding="UTF-8")
If there is any other characters like arabic or chinese, i get this error saying
'charmap' codec can't encode character '\u0110' in position 1: character maps to <undefined>
while reading csv file using pd.read_csv
I have visited a lot of stack overflow and a lot of other solution providers but none of them fix my problem. This is the fault with my local machine. So can anyone help me figure out where this problem persists ?
Thank you

Special Charecter in Header line during text import

I'm trying to write a python script to import a data file generated by data aquistion software (EC-lab). I would like to keep the column headers as they are in the file and not manually define them since they are not uniform across all files (different techniques will generate data in different orders and will have a different number of headers). The problem is that the header text in the file contains forward slashes (eg "ox/red", "time/s").
I am getting an ascii error when I try to load the data with the header column
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 19: ordinal not in range(128)
I've tried adding encoding as a keyword argument based off other solutions but that didn't yield a solution
data = np.genfromtxt("20180611_bB_GCE-G.mpt", dtype=None, delimiter='\t', names=True, skip_header=61, encoding='utf-8')
I'm currently using genfromtxt as the data import technique
data = np.genfromtxt("filename.mpt", dtype=None, delimiter='\t', names=True, skip_header=61)
First, forward slashes in headers are not a problem for ASCII, for CSV files, or for NumPy.
My guess is that the real problem is that your CSV is in Latin-1, or a Latin-1-compatible encoding like Windows-1252, and that one of the headers includes the micro sign µ, which is 0xB5 in those encodings. Or that the headers aren't actually a problem at all, and you have µ characters in some of the data.
Either way, with the default encoding of ASCII, you get an error about 0xb5 not being in range(128), exactly like the one in your question.
If you try to fix this by explicitly specifying encoding='utf-8', that's the wrong encoding, and you just get a different error, about 0xb5 being an invalid start byte.
If you fix it by specifying encoding='latin-1', it should work.
More generally, you have to know what encoding your files are actually in, not just guess wildly. Especially if you're on Windows, where a lot of files are going to be in whatever encoding you have set as your OEM code page, while others will be in UTF-16-LE, while others will be in UTF-8 but with an illegal BOM, etc.
The program that generated them should document what encoding it uses, or have options to let you pick. If it doesn't, you need to try, e.g., viewing the file in a text editor that lets you select the encoding to try to figure out which one looks correct. Or you can use a tool like chardet to help you guess.

Read a file in python having rogue byte 0xc0 that causes utf-8 and ascii to error out

Trying to read a tab-separated file into pandas dataframe:
>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False)
It errors out like so:
b'Skipping line 58: expected 11 fields, saw 12\n'
Traceback (most recent call last):
...(many lines)...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 115: invalid start byte
It seems the byte 0xc0 causes pain at both utf-8 and ascii encodings.
>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False, encoding='ascii')
...(many lines)...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 115: ordinal not in range(128)
I ran into the same issues with csv module's reader too.
If I import the file into OpenOffice Calc, it gets imported properly, the columns are properly recognized etc. Probably the offending 0xc0 byte is ignored there. This is not some vital piece of the data etc, it's probably just a fluke write error by the system that generated this file. I'll be happy to even zap the line where his occurs if it comes to that. I just want to read the file into the python program. The error_bad_lines=False option of pandas ought to have taken care of this problem but no dice. Also, the file does NOT have any content in non-english scripts that makes unicode so necessary. It's all standard english letters and numbers. I tried utf-16 utf-32 etc too but they only caused more errors of their own.
How to make python (pandas Dataframe in particular) read a file having one or more rogue byte 0xc0 characters?
Moving this answer here from another place where it got a hostile reception.
Found one standard that actually accepts (meaning, doesn't error out) byte 0xc0 :
encoding="ISO-8859-1"
Note: This entails making sure the rest of the file doesn't have unicode chars. This may be helpful for folks like me who didn't have any unicode chars in their file anyways and just wanted python to load the damn thing and both utf-8 and ascii encodings were erroring out.
More on ISO-8859-1 : What is the difference between UTF-8 and ISO-8859-1?
New command that works:
>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False, encoding='ISO-8859-1')
After reading it in, the dataframe is fine, the columns, data are all working like they did in OpenOffice Calc. I still have no idea where the offending 0xc0 byte went but it doesn't matter as I've got the data I needed.

Python pandas load csv ANSI Format as UTF-8

I want to load a CSV File with pandas in Jupyter Notebooks which contains characters like ä,ö,ü,ß.
When i open the csv file with Notepad++ here is one example row which causes trouble in ANSI Format:
Empf„nger;Empf„ngerStadt;Empf„ngerStraáe;Empf„ngerHausnr.;Empf„ngerPLZ;Empf„ngerLand
The correct UTF-8 outcome for Empf„nger should be: Empfänger
Now when i load the CSV Data in Python 3.6 pandas on Windows with the following code:
df_a = pd.read_csv('file.csv',sep=';',encoding='utf-8')
I get and Error Message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position xy: invalid continuation byte
Position 'xy' is the position where the character occurs that causes the error message
when i use the ansi format to load my csv file it works but display the umlaute incorrect.
Example code:
df_a = pd.read_csv('afile.csv',sep=';',encoding='ANSI')
Empfänger is represented as: Empf„nger
Note: i have tried to convert the file to UTF-8 in Notepad++ and load it afterwards with the pandas module but i still get the same error.
I have searched online for a solution but the provided solutions such as "change format in notepad++ to utf-8" or "use encoding='UTF-8'" or 'latin1' which gives me the same result as ANSI format or
import chardet
with open('afile.csv', 'rb') as f:
result = chardet.detect(f.readline())
df_a = pd.read_csv('afile.csv',sep=';',encoding=result['encoding'])
didnt work for me.
encoding='cp1252'
throws the following exception:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2: character maps to <undefined>
I also tried to replace Strings afterwards with the x.replace() method but the character ü disappears completely after loaded into a pandas DataFrame
If you don't know which are your file encoding, I think that the fastest approach is to open the file on a text editor, like Notepad++ to check how your file are encoding.
Then you go to the python documentation and look for the correct codec to use.
In your case , ANSI, the codec is 'mbcs', so your code will look like these
df_a = pd.read_csv('file.csv',sep=';',encoding='mbcs')
When EmpfängerStraße shows up as Empf„ngerStraáe when decoded as ”ANSI”, or more correctly cp1250 in this case, then the actual encoding of the data is most likely cp850:
print 'Empf„ngerStraáe'.decode('utf8').encode('cp1250').decode('cp850')
Or Python 3, where literal strings are already unicode strings:
print("Empf„ngerStraáe".encode("cp1250").decode("cp850"))
I couldnt find a proper solution after trying out all the well known encodings from ISO-8859-1 to 8859-15, from UTF-8 to UTF-32, from Windows-1250-1258 and nothing worked properly. So my guess is that the text encoding got corrupted during the export. My own solution to this is to load the textfile in a Dataframe with Windows-1251 as it does not cut out special characters in my text file and then replaced all broken characters with the corresponding ones. Its a rather dissatisfying solution that takes a lot of time to compute but its better than nothing.
You could use the encoding value UTF-16LE to solve the problem
pd.read_csv("./file.csv", encoding="UTF-16LE")
The file.csv should be saved using encoding UTF-16LE by NotePad++, option UCS-2 LE BOM
Best,
cp1252 works on both linux and windows to decode latin1 encoded files.
df = pd.read_csv('data.csv',sep=';',encoding='cp1252')
Although, if you are running on a windows machine, I would recommend using
df = pd.read_csv('data.csv', sep=';', encoding='mbcs')
Ironically, using 'latin1' in the encoding does not always work. Especially if you want to convert file to a different encoding.

Python encoding issue while reading a file

I am trying to read a file that contains this character in it "ë". The problem is that I can not figure out how to read it no matter what I try to do with the encoding. When I manually look at the file in textedit it is listed as a unknown 8-bit file. If I try changing it to utf-8, utf-16 or anything else it either does not work or messes up the entire file. I tried reading the file just in standard python commands as well as using codecs and can not come up with anything that will read it correctly. I will include a code sample of the read below. Does anyone have any clue what I am doing wrong? This is Python 2.17.10 by the way.
readFile = codecs.open("FileName",encoding='utf-8')
The line I am trying to read is this with nothing else in it.
Aeëtes
Here are some of the errors I get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x91 in position 0: invalid start byte
UTF-16 stream does not start with BOM"
UnicodeError: UTF-16 stream does not start with BOM -- I know this one is that it is not a utf-16 file.
UnicodeDecodeError: 'ascii' codec can't decode byte 0x91 in position 0: ordinal not in range(128)
If I don't use a Codec the word comes in as Ae?tes which then crashes later in the program. Just to be clear, none of the suggested questions or any other anywhere on the net have pointed to an answer. One other detail that might help is that I am using OS X, not Windows.
Credit for this answer goes to RadLexus for figuring out the proper encoding and also to Mad Physicist who pointed me in the right track even if I did not consider all possible encodings.
The issue is apparently a Mac will convert the .txt file to mac_roman. If you use that encoding it will work perfectly.
This is the line of code that I used to convert it.
readFile = codecs.open("FileName",encoding='mac_roman')

Categories

Resources