UnicodeDecodeError when reading CSV File into Dataframe

UnicodeDecodeError when reading CSV File into Dataframe - python

I am using the code below to read a csv file into a dataframe. However, I get the error pandas.parser.CParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2 and hence I changed pd.read_csv('D:/TRYOUT.csv') to pd.read_csv('D:/TRYOUT.csv', error_bad_lines=False) as suggested here. However, I now get the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 1: invalid continuation byte in the same line.
def ExcelFileReader():
mergedf = pd.read_csv('D:/TRYOUT.csv', error_bad_lines=False)
return mergedf

If you're on Windows, you probably need to use pd.read_csv(filename, encoding='latin-1')

I had a similar problem and had to use
utf-8-sig
as the encoding,
The reason i used utf-8-sig is because if you do ever get non-Latin characters it wont be able to deal with it correctly. There are a few ways of getting around the problem, but i guess you can just choose the best that suits your needs.
Hope that helps.

If you would like to exclude the rows providing error and ignore the malformed data then you need to use:
pd.read_csv(file_path, encoding="utf8", error_bad_lines=False, encoding_errors="ignore")

Related

'utf-8' codec can't decode byte 0xb8 in position 77: invalid start byte [duplicate]

I am new to Python, I am trying to read csv file using below script.
Past=pd.read_csv("C:/Users/Admin/Desktop/Python/Past.csv",encoding='utf-8')
But, getting error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 35: invalid start byte", Please help me to know issue here, I used encoding in script thought it will resolve error.

This happens because you chose the wrong encoding.
Since you are working on a Windows machine, just replacing
Past=pd.read_csv("C:/Users/.../Past.csv",encoding='utf-8')
with
Past=pd.read_csv("C:/Users/.../Past.csv",encoding='cp1252')
should solve the problem.

Try using :
pd.read_csv("Your filename", encoding="ISO-8859-1")
The code that I parsed from some website was converted in this encoding instead of default UTF-8 encoding which is standard.

Use this solution it will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.
with open(path, encoding="utf8", errors='ignore') as f:
Using errors='ignore' You'll just lose some characters. but if your don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server. Then its a easy direct solution. reference

The following works very well for me:
encoding = 'latin1'

Its an old question but shows up while searching for solutions to this error. So I thought to answer for all who still stumble on this thread.
The encoding for the file can be checked before passing the correct value for the encoding argument.
To get the encoding, a simple option in Windows is to open the file in Notepad++ and look at the encoding. The correct value for the encoding argument can then be found in the python documentation.
Look at this question and the answers on stackoverflow for more details on different possibilities to get the file encoding.

Using the code bellow works for me:
with open(keeniz_dir + '/world_cities.csv', 'r', encoding='latin1') as input:

df = pd.read_csv( "/content/data.csv",encoding='latin1')
Just add ,encoding='latin1' and it will work

Don't pass encoding option unless you are sure about file encoding. Default value encoding=None passes errors="replace" to open() function called. Characters with encoding errors will be substituted with replacements, you can then figure out correct encoding or just use the resulting Dataframe. If wrong encoding is provided pd will pass errors="strict" to open() and get ValueError if encoding is incorrect.

Pandas read_csv UnicodeDecodeError: invalid start byte

I am trying to read a .csv file using pandas but get this error.
Line of code:
pd.read_csv(r"C:\Users\antba\Desktop\ffstats.csv")
Error message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 85: invalid start byte
I've removed the 'r' from the pd.read_csv command but was met with a different error message. Any help would be appreciated, thank you.

Okay, this might be due to encoding. Second Please try google before asking questions on StackOverflow it will help to learn more things.
The reason for your problem is encoding if you know the encoding of CSV try something like this.
pd.read_csv('your_file.csv', encoding = 'ISO-8859-1')

Encoding discrepancy with Iris Dataset

After I downloaded the dataset as iris.data, I renamed it to iris.data.txt. I was trying to circumvent this reported error on SO:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 8: invalid continuation byte
After reading up, I tried this:
dataset = pd.read_csv('iris.data.txt', header=None, names=names,encoding="ISO-8859-1")
This partly solved the error but some rows were still garbage.
Then I tried to open it with Sublime, save it with utf-8 encoding and then dataset = pd.read_csv('iris.data.txt', header=None, names=names,encoding="utf-8")
But this doesn't solve the problem either. I'm running Python 3 on Mac OS. What could possibly render the data readable directly?
[EDIT]:
The datatype reads: Web archive. In Spyder, the file appears as iris.data.webarchive
If I try dataset = pd.read_csv('iris.data.webarchive', header=None), it gives this traceback:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 5
If I try dataset = pd.read_csv('iris.data', header=None), it gives FileNotFoundError: File b'iris.data' does not exist

I figured out my rookie mistake. I had to save the page as 'source' instead of 'webarchive' (which is the default Mac setting)

'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

I try to read and print the following file: txt.tsv (https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2017q3_notes.zip)
According to the SEC the data set is provided in a single encoding, as follows:
Tab Delimited Value (.txt): utf-8, tab-delimited, \n- terminated lines, with the first line containing the field names in lowercase.
My current code:
import csv
with open('txt.tsv') as tsvfile:
reader = csv.DictReader(tsvfile, dialect='excel-tab')
for row in reader:
print(row)
All attempts ended with the following error message:
'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte
I am a bit lost. Can anyone help me?

Encoding in the file is 'windows-1252'. Use:
open('txt.tsv', encoding='windows-1252')

If someone works on Turkish data, then I suggest this line:
df = pd.read_csv("text.txt",encoding='windows-1254')

ds = pd.read_csv('/Dataset/test.csv', encoding='windows-1252')
Works fine for me, thanks.

i have the same error message for .csv file, and This Worked for me :
df = pd.read_csv('Text.csv',encoding='ANSI')

I also encountered the same issue and worked while using latin1 encoding, refer to the sample code to apply in your codebase. Give a try if above resolution doesn't work.
df=pd.read_csv("../CSV_FILE.csv",na_values=missing, encoding='latin1')

If the input has a stray '\xa0', then it's not in UTF-8, full stop.
Yes, you have to either recode it to UTF-8 (see: iconv, recode commands, or a lot of text editors and IDEs can do it), or read it using an 8-bit encoding (as all the other answers suggest).
What you should ask yourself is - what is this character after all (0xa0 or 160)?
Well, in many 8-bit encodings it's a non-breaking space (like in HTML). For at least one DOS encoding it's an accented "a" character. That's why you need to look at the result of decoding it from the 8-bit encoding.
BTW, sometimes people say "UTF-8", and they mean "mostly ASCII, I guess". And if it was a non-breaking space, they weren't that far:
In [1]: '\xa0'.encode()
Out[1]: b'\xc2\xa0'
One exptra preceeding '\xc2' byte would do the trick.

Error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

https://github.com/affinelayer/pix2pix-tensorflow/tree/master/tools
An error occurred when compiling "process.py" on the above site.
python tools/process.py --input_dir data -- operation resize --outp
ut_dir data2/resize
data/0.jpg -> data2/resize/0.png
Traceback (most recent call last):
File "tools/process.py", line 235, in <module>
main()
File "tools/process.py", line 167, in main
src = load(src_path)
File "tools/process.py", line 113, in load
contents = open(path).read()
File"/home/user/anaconda3/envs/tensorflow_2/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
What is the cause of the error?
Python's version is 3.5.2.

Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
Since you did not provide any code we could look at, we only could guess on the rest.
From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()). I propose to recode this in a fashion like this:
with open(path, 'rb') as f:
contents = f.read()
That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.

Use this solution it will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.
with open(path, encoding="utf8", errors='ignore') as f:
Using errors='ignore'
You'll just lose some characters. but if your don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server.
Then its a easy direct solution.
reference

Use encoding format ISO-8859-1 to solve the issue.

Had an issue similar to this, Ended up using UTF-16 to decode. my code is below.
with open(path_to_file,'rb') as f:
contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
contents = contents.split("\r\n")
this would take the file contents as an import, but it would return the code in UTF format. from there it would be decoded and seperated by lines.

I've come across this thread when suffering the same error, after doing some research I can confirm, this is an error that happens when you try to decode a UTF-16 file with UTF-8.
With UTF-16 the first characther (2 bytes in UTF-16) is a Byte Order Mark (BOM), which is used as a decoding hint and doesn't appear as a character in the decoded string. This means the first byte will be either FE or FF and the second, the other.
Heavily edited after I found out the real answer

It simply means that one chose the wrong encoding to read the file.
On Mac, use file -I file.txt to find the correct encoding. On Linux, use file -i file.txt.

I had a similar issue with PNG files. and I tried the solutions above without success.
this one worked for me in python 3.8
with open(path, "rb") as f:

use only
base64.b64decode(a)
instead of
base64.b64decode(a).decode('utf-8')

This is due to the different encoding method when read the file. In python, it defaultly
encode the data with unicode. However, it may not works in various platforms.
I propose an encoding method which can help you solve this if 'utf-8' not works.
with open(path, newline='', encoding='cp1252') as csvfile:
reader = csv.reader(csvfile)
It should works if you change the encoding method here. Also, you can find other encoding method here standard-encodings , if above doesn't work for you.

Those getting similar errors while handling Pandas for data frames use the following solution.
example solution.
df = pd.read_csv("File path", encoding='cp1252')

I had this UnicodeDecodeError while trying to read a '.csv' file using pandas.read_csv(). In my case, I could not manage to overcome this issue using other encoder types. But instead of using
pd.read_csv(filename, delimiter=';')
I used:
pd.read_csv(open(filename, 'r'), delimiter=';')
which just seems working fine for me.
Note that: In open() function, use 'r' instead of 'rb'. Because 'rb' returns bytes object that causes to happen this decoder error in the first place, that is the same problem in the read_csv(). But 'r' returns str which is needed since our data is in .csv, and using the default encoding='utf-8' parameter, we can easily parse the data using read_csv() function.

if you are receiving data from a serial port, make sure you are using the right baudrate (and the other configs ) : decoding using (utf-8) but the wrong config will generate the same error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
to check your serial port config on linux use : stty -F /dev/ttyUSBX -a

I had a similar issue and searched all the internet for this problem
if you have this problem just copy your HTML code in a new HTML file and use the normal <meta charset="UTF-8">
and it will work....
just create a new HTML file in the same location and use a different name

Check the path of the file to be read. My code kept on giving me errors until I changed the path name to present working directory. The error was:
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

If you are on a mac check if you for a hidden file, .DS_Store. After removing the file my program worked.

I had a similar problem.
Solved it by:
import io
with io.open(filename, 'r', encoding='utf-8') as fn:
lines = fn.readlines()
However, I had another problem. Some html files (in my case) were not utf-8, so I received a similar error. When I excluded those html files, everything worked smoothly.
So, except from fixing the code, check also the files you are reading from, maybe there is an incompatibility there indeed.

You have to use the encoding as latin1 to read this file as there are some special character in this file, use the below code snippet to read the file.
The problem here is the encoding type. When Python can't convert the data to be read, it gives an error.
You can you latin1 or other encoding values.
I say try and test to find the right one for your dataset.

I have the same issue when processing a file generated from Linux. It turns out it was related with files containing question marks..

Following code worked in my case:
df = pd.read_csv(filename,sep = '\t', encoding='cp1252')

If possible, open the file in a text editor and try to change the encoding to UTF-8. Otherwise do it programatically at the OS level.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

UnicodeDecodeError when reading CSV File into Dataframe - python

If you're on Windows, you probably need to use pd.read_csv(filename, encoding='latin-1')

If you would like to exclude the rows providing error and ignore the malformed data then you need to use: pd.read_csv(file_path, encoding="utf8", error_bad_lines=False, encoding_errors="ignore")

Related

'utf-8' codec can't decode byte 0xb8 in position 77: invalid start byte [duplicate]

Pandas read_csv UnicodeDecodeError: invalid start byte

Encoding discrepancy with Iris Dataset

'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte

Error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Categories

Resources