Error while importing csv in Python using pandas - python

I have started to learn Python for data science. I am already using R on almost daily basis. I stack on first step. I try to import csv file using Pandas read_csv file method. I have problem with encoding the file while importing.
If I use read.csv from R everything is ok:
df <- read.csv2("some_path/myfile.txt", stringsAsFactors = FALSE, encoding = 'UTF-8')
but if I use similar code in python:
import pandas as pd
df = pd.read_csv("some_path/myfile.txt", sep = ';', encoding= 'utf8')
it returns an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc6 in position 13: invalid continuation byte
How is it possible that I can import a file with "utf-8" encoding in R, but not in Python?
If I use different encoding (latin1 or iso-8859-1), it imports the file successfully but characters are not encoded in right way.

Even if I don't understand why UTF-8 works in R but not in Python, I found out that cp1250 encoding works fine.

Use encoding "UTF-16". I used that to resolve my issue with the same error.

Related

Pandas read_csv filepath with special characters codec can't decode

I am using Python version 3.5.3 and Pandas version 0.20.1
I use read_csv to read in csv files. I use a file pointer according to this post (I prefer this over the solution using _enablelegacywindowsfsencoding()). The following code works:
import pandas as pd
with open("C:/Desktop/folder/myfile.csv") as fp:
df=pd.read_csv(fp, sep=";", encoding ="latin")
This does work. However, when there is a special character like ä in the filename as follows:
import pandas as pd
with open("C:/Desktop/folderÄ/myfile.csv") as fp:
df=pd.read_csv(fp, sep=";", encoding ="latin")
Python displays an error message: (unicode error) 'utf-8' codec can't decode byte oxc4 in position 0: unexpected end of data.
I also tried to add a 'r' before the filepath, however I get the same error message, except that now I get a position as integer number which is exactly where my special character is in the filepath.
So the reason is the special character in the filepath name.
(Not a decode error which can be solved by using encoding="utf-8" or any other like ISO-5589-1. To be absolutely sure, I tried it with the following encodings and always got the same error message: utf-8, ISO-5589-1, cp1252)
The error indicates your source file (not the data file) is not encoded in UTF-8. In Python 3, your source file must either be saved in UTF-8 encoding, or you must declare the encoding that the source file is saved in with a special comment, e.g. #coding=Windows-1252 at the top of the file. \xc4 is the Windows-1252 encoding of Ä and is the default encoding for Western European and US Windows, so it's a good guess. Ideally, re-save your source in UTF-8.
For example, if the source is Windows-1252-encoded and the data file is GB2312-encoded (Chinese):
#coding=Windows-1252 # encoding of source file
import pandas as pd
with open('DÄTÄ.csv',encoding='gb2312') as f: # encoding of data file
data = pd.read_csv(f)
Note that source files default to UTF-8 encoding, but open defaults to the encoding returned by locale.getpreferredencoding(FALSE). Since that varies with OS and configuration, it is best to always specify the encoding when opening files.
Try using unicode file paths u'path/to/files' for example
import pandas as pd
with open(u'C:/Desktop/folderÄ/myfile.csv') as fp:
df=pd.read_csv(fp, sep=";", encoding ="latin")

Decode error with Pandas reading a .txt that only occurs on one compuer

I have a comma-separated .txt file with French characters such as Vétérinaire and Désinfectant.
import pandas as pd
df = pd.read_csv('somefile.txt', sep=',', header=None, encoding='utf-8')
[Decode error - output not utf-8]
I have read many Q&A posts (including this) and tried many different encoding such as 'latin1' and 'utf-16', they didn't work. However, I tried to run the exact same script on the different Windows 10 computer with similar Python setup (all Python 3.6), it works perfectly fine in the other computer.
Edit: I tried this. Using encoding='cp1252' helps for some of the .txt files I want to import, but for a few .txt files, it gives the following error.
File "C:\Program_Files_Extra\Anaconda3\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 25: character maps to <undefined>
Edit:
Trying to identify encoding from chardet
import chardet
import pandas as pd
test_txt = 'somefile.txt'
rawdata = open(test_txt, 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']
print (charenc)
df = pd.read_csv(test_txt, sep=',', header=None, encoding=charenc)
print (df.head())
utf-8
[Decode error - output not utf-8]
Your program opens your files with a default encoding and that doesn't match the contents of the file you are trying to open.
Option 1: Decode the file contents to python string objects:
rawdata = open(test_txt, 'rb', encoding='UTF8').read()
Option 2: Open the csv file in an editor like Sublime Text and save it with utf-8 encoding to easily read the file through pandas.

Python: How to deal with replacement character � [duplicate]

Here is my code,
for line in open('u.item'):
# Read each line
Whenever I run this code it gives the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte
I tried to solve this and add an extra parameter in open(). The code looks like:
for line in open('u.item', encoding='utf-8'):
# Read each line
But again it gives the same error. What should I do then?
As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1", so replacing open("u.item", encoding="utf-8") with open('u.item', encoding = "ISO-8859-1") will solve the problem.
The following also worked for me. ISO 8859-1 is going to save a lot, mainly if using Speech Recognition APIs.
Example:
file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1")
Your file doesn't actually contain UTF-8 encoded data; it contains some other encoding. Figure out what that encoding is and use it in the open call.
In Windows-1252 encoding, for example, the 0xe9 would be the character é.
Try this to read using Pandas:
pd.read_csv('u.item', sep='|', names=m_cols, encoding='latin-1')
This works:
open('filename', encoding='latin-1')
Or:
open('filename', encoding="ISO-8859-1")
If you are using Python 2, the following will be the solution:
import io
for line in io.open("u.item", encoding="ISO-8859-1"):
# Do something
Because the encoding parameter doesn't work with open(), you will be getting the following error:
TypeError: 'encoding' is an invalid keyword argument for this function
You could resolve the problem with:
for line in open(your_file_path, 'rb'):
'rb' is reading the file in binary mode. Read more here.
You can try this way:
open('u.item', encoding='utf8', errors='ignore')
Based on another question on Stackoverflow and previous answers in this post, I would like to add a help to find the right encoding.
If your script runs on a Linux OS, you can get the encoding with the file command:
file --mime-encoding <filename>
Here is a python script to do that for you:
import sys
import subprocess
if len(sys.argv) < 2:
print("Usage: {} <filename>".format(sys.argv[0]))
sys.exit(1)
def find_encoding(fname):
"""Find the encoding of a file using file command
"""
# find fullname of file command
which_run = subprocess.run(['which', 'file'], stdout=subprocess.PIPE)
if which_run.returncode != 0:
print("Unable to find 'file' command ({})".format(which_run.returncode))
return None
file_cmd = which_run.stdout.decode().replace('\n', '')
# run file command to get MIME encoding
file_run = subprocess.run([file_cmd, '--mime-encoding', fname],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
if file_run.returncode != 0:
print(file_run.stderr.decode(), file=sys.stderr)
# return encoding name only
return file_run.stdout.decode().split()[1]
# test
print("Encoding of {}: {}".format(sys.argv[1], find_encoding(sys.argv[1])))
I was using a dataset downloaded from Kaggle while reading this dataset it threw this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position
183: invalid continuation byte
So this is how I fixed it.
import pandas as pd
pd.read_csv('top50.csv', encoding='ISO-8859-1')
This is an example for converting a CSV file in Python 3:
try:
inputReader = csv.reader(open(argv[1], encoding='ISO-8859-1'), delimiter=',',quotechar='"')
except IOError:
pass
The encoding replaced with encoding='ISO-8859-1'
for line in open('u.item', encoding='ISO-8859-1'):
print(line)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7044: invalid continuation byte
The above error is occuring due to encoding
Solution:- Use “encoding='latin-1'”
Reference:- https://pandas.pydata.org/docs/search.html?q=encoding
Sometimes when using open(filepath) in which filepath actually is not a file would get the same error, so firstly make sure the file you're trying to open exists:
import os
assert os.path.isfile(filepath)
Open your file with Notepad++, select "Encoding" or "Encodage" menu to identify or to convert from ANSI to UTF-8 or the ISO 8859-1 code page.
So that the web-page is searched faster for the google-request on a similar question (about error with UTF-8), I leave my solvation here for others.
I had problem with .csv file opening with that description:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 150: invalid continuation byte
I opened the file with NotePad & counted 150th position: that was a Cyrillic symbol.
I resaved that file with 'Save as..' command with Encoding 'UTF-8' & my program started to work.
I keep coming across this error and often the solution is not resolved by encoding='utf-8' but in fact with engine='python' like this:
import pandas as pd
file = "c:\\path\\to_my\\file.csv"
df = pd.read_csv(file, engine='python')
df
A link to the docs is here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Use this, if you are directly loading data from github or kaggle DF=pd.read_csv(file,encoding='ISO-8859-1')
In my case, this issue occurred because I modified the extension of an excel file (.xlsx) directly into a (.csv) file directly...
The solution was to open the file then save it as new (.csv) file (i.e. file -> save as -> select the (.csv) extension and save it. This worked for me.
My issue was similar in that UTF-8 text was getting passed to the Python script.
In my case, it was from SQL using the sp_execute_external_script in the Machine Learning service for SQL Server. For whatever reason, VARCHAR data appears to get passed as UTF-8, whereas NVARCHAR data gets passed as UTF-16.
Since there's no way to specify the default encoding in Python, and no user-editable Python statement parsing the data, I had to use the SQL CONVERT() function in my SELECT query in the #input_data parameter.
So, while this query
EXEC sp_execute_external_script #language = N'Python',
#script = N'
OutputDataSet = InputDataSet
',
#input_data_1 = N'SELECT id, text FROM the_error;'
WITH RESULT SETS (([id] int, [text] nvarchar(max)));
gives the error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 0: unexpected end of data
Using CONVERT(type, data) (CAST(data AS type) would also work)
EXEC sp_execute_external_script #language = N'Python',
#script = N'
OutputDataSet = InputDataSet
',
#input_data_1 = N'SELECT id, CONVERT(NVARCHAR(max), text) FROM the_error;'
WITH RESULT SETS (([id] INT, [text] NVARCHAR(max)));
returns
id text
1 Ç

Python pandas load csv ANSI Format as UTF-8

I want to load a CSV File with pandas in Jupyter Notebooks which contains characters like ä,ö,ü,ß.
When i open the csv file with Notepad++ here is one example row which causes trouble in ANSI Format:
Empf„nger;Empf„ngerStadt;Empf„ngerStraáe;Empf„ngerHausnr.;Empf„ngerPLZ;Empf„ngerLand
The correct UTF-8 outcome for Empf„nger should be: Empfänger
Now when i load the CSV Data in Python 3.6 pandas on Windows with the following code:
df_a = pd.read_csv('file.csv',sep=';',encoding='utf-8')
I get and Error Message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position xy: invalid continuation byte
Position 'xy' is the position where the character occurs that causes the error message
when i use the ansi format to load my csv file it works but display the umlaute incorrect.
Example code:
df_a = pd.read_csv('afile.csv',sep=';',encoding='ANSI')
Empfänger is represented as: Empf„nger
Note: i have tried to convert the file to UTF-8 in Notepad++ and load it afterwards with the pandas module but i still get the same error.
I have searched online for a solution but the provided solutions such as "change format in notepad++ to utf-8" or "use encoding='UTF-8'" or 'latin1' which gives me the same result as ANSI format or
import chardet
with open('afile.csv', 'rb') as f:
result = chardet.detect(f.readline())
df_a = pd.read_csv('afile.csv',sep=';',encoding=result['encoding'])
didnt work for me.
encoding='cp1252'
throws the following exception:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2: character maps to <undefined>
I also tried to replace Strings afterwards with the x.replace() method but the character ü disappears completely after loaded into a pandas DataFrame
If you don't know which are your file encoding, I think that the fastest approach is to open the file on a text editor, like Notepad++ to check how your file are encoding.
Then you go to the python documentation and look for the correct codec to use.
In your case , ANSI, the codec is 'mbcs', so your code will look like these
df_a = pd.read_csv('file.csv',sep=';',encoding='mbcs')
When EmpfängerStraße shows up as Empf„ngerStraáe when decoded as ”ANSI”, or more correctly cp1250 in this case, then the actual encoding of the data is most likely cp850:
print 'Empf„ngerStraáe'.decode('utf8').encode('cp1250').decode('cp850')
Or Python 3, where literal strings are already unicode strings:
print("Empf„ngerStraáe".encode("cp1250").decode("cp850"))
I couldnt find a proper solution after trying out all the well known encodings from ISO-8859-1 to 8859-15, from UTF-8 to UTF-32, from Windows-1250-1258 and nothing worked properly. So my guess is that the text encoding got corrupted during the export. My own solution to this is to load the textfile in a Dataframe with Windows-1251 as it does not cut out special characters in my text file and then replaced all broken characters with the corresponding ones. Its a rather dissatisfying solution that takes a lot of time to compute but its better than nothing.
You could use the encoding value UTF-16LE to solve the problem
pd.read_csv("./file.csv", encoding="UTF-16LE")
The file.csv should be saved using encoding UTF-16LE by NotePad++, option UCS-2 LE BOM
Best,
cp1252 works on both linux and windows to decode latin1 encoded files.
df = pd.read_csv('data.csv',sep=';',encoding='cp1252')
Although, if you are running on a windows machine, I would recommend using
df = pd.read_csv('data.csv', sep=';', encoding='mbcs')
Ironically, using 'latin1' in the encoding does not always work. Especially if you want to convert file to a different encoding.

Error reading csv file using pandas [duplicate]

This question already has answers here:
UnicodeDecodeError when reading CSV file in Pandas with Python
(25 answers)
Closed 5 years ago.
what i am trying is reading a csv to make a dataframe---making changes in a column---again updating/reflecting changed value into same csv(to_csv)- again trying to read that csv to make another dataframe...there i am getting an error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
my code is
import pandas as pd
df = pd.read_csv("D:\ss.csv")
df.columns #o/p is Index(['CUSTOMER_MAILID', 'False', 'True'], dtype='object')
df['True'] = df['True'] + 2 #making changes to one column of type float
df.to_csv("D:\ss.csv") #updating that .csv
df1 = pd.read_csv("D:\ss.csv") #again trying to read that csv
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
So please suggest how can i avoid the error and be able to read that csv again to a dataframe.
I know somewhere i am missing "encode = some codec type" or "decode = some type" while reading and writing to csv.
But i don't know what exactly should be changed.so need help.
Known encoding
If you know the encoding of the file you want to read in,
you can use
pd.read_csv('filename.txt', encoding='encoding')
These are the possible encodings:
https://docs.python.org/3/library/codecs.html#standard-encodings
Unknown encoding
If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work.
import chardet
import pandas as pd
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])
Is that error happening on your first read of the data, or on the second read after you write it out and read it back in again? My guess is that it's actually happening on the first read of the data, because your CSV has an encoding that isn't UTF-8.
Try opening that CSV file in Notepad++, or Excel, or LibreOffice. Does your data source have the ç (C with cedilla) character in it? If it does, then that 0xE7 byte you're seeing is probably the ç encoded in either Latin-1 or Windows-1252 (called "cp1252" in Python).
Looking at the documentation for the Pandas read_csv() function, I see it has an encoding parameter, which should be the name of the encoding you expect that CSV file to be in. So try adding encoding="cp1252" to your read_csv() call, as follows:
df = pd.read_csv(r"D:\ss.csv", encoding="cp1252")
Note that I added the character r in front of the filename, so that it will be considered a "raw string" and backslashes won't be treated specially. That way you don't get a surprise when you change the filename from ss.csv to new-ss.csv, where the string D:\new-ss.csv would be read as D, :, newline character, e, w, etc.
Anyway, try that encoding parameter on your first read_csv() call and see if it works. (It's only a guess, since I don't know your actual data. If the data file isn't private and isn't too large, try posting the data file so we can see its contents -- that would let us do better than just guessing.)
One simple solution is you can open the csv file in an editor like Sublime Text and save it with 'utf-8' encoding. Then we can easily read the file through pandas.
Above method used by importing and then detecting file type works
import chardet
import pandas as pd
import chardet
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])
Yes you'll get this error. I have work around with this problem, by opening csv file in notepad++ and changing the encoding throught Encoding menu -> convert to UTF-8. Then saving the file. Then again running python program over it.
Other solution is using codecs module in python for encoding-decoding of files. I haven't used that.
I am new to python. Ran into this exact issue when I manually changed the extension on my excel file to .csv and tried to read it with read_csv. However, if I opened the excel file and saved as csv file instead it seemed to work.

Categories

Resources