Python: How to deal with replacement character � [duplicate] - python

Here is my code,
for line in open('u.item'):
# Read each line
Whenever I run this code it gives the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte
I tried to solve this and add an extra parameter in open(). The code looks like:
for line in open('u.item', encoding='utf-8'):
# Read each line
But again it gives the same error. What should I do then?

As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1", so replacing open("u.item", encoding="utf-8") with open('u.item', encoding = "ISO-8859-1") will solve the problem.

The following also worked for me. ISO 8859-1 is going to save a lot, mainly if using Speech Recognition APIs.
Example:
file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1")

Your file doesn't actually contain UTF-8 encoded data; it contains some other encoding. Figure out what that encoding is and use it in the open call.
In Windows-1252 encoding, for example, the 0xe9 would be the character é.

Try this to read using Pandas:
pd.read_csv('u.item', sep='|', names=m_cols, encoding='latin-1')

This works:
open('filename', encoding='latin-1')
Or:
open('filename', encoding="ISO-8859-1")

If you are using Python 2, the following will be the solution:
import io
for line in io.open("u.item", encoding="ISO-8859-1"):
# Do something
Because the encoding parameter doesn't work with open(), you will be getting the following error:
TypeError: 'encoding' is an invalid keyword argument for this function

You could resolve the problem with:
for line in open(your_file_path, 'rb'):
'rb' is reading the file in binary mode. Read more here.

You can try this way:
open('u.item', encoding='utf8', errors='ignore')

Based on another question on Stackoverflow and previous answers in this post, I would like to add a help to find the right encoding.
If your script runs on a Linux OS, you can get the encoding with the file command:
file --mime-encoding <filename>
Here is a python script to do that for you:
import sys
import subprocess
if len(sys.argv) < 2:
print("Usage: {} <filename>".format(sys.argv[0]))
sys.exit(1)
def find_encoding(fname):
"""Find the encoding of a file using file command
"""
# find fullname of file command
which_run = subprocess.run(['which', 'file'], stdout=subprocess.PIPE)
if which_run.returncode != 0:
print("Unable to find 'file' command ({})".format(which_run.returncode))
return None
file_cmd = which_run.stdout.decode().replace('\n', '')
# run file command to get MIME encoding
file_run = subprocess.run([file_cmd, '--mime-encoding', fname],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
if file_run.returncode != 0:
print(file_run.stderr.decode(), file=sys.stderr)
# return encoding name only
return file_run.stdout.decode().split()[1]
# test
print("Encoding of {}: {}".format(sys.argv[1], find_encoding(sys.argv[1])))

I was using a dataset downloaded from Kaggle while reading this dataset it threw this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position
183: invalid continuation byte
So this is how I fixed it.
import pandas as pd
pd.read_csv('top50.csv', encoding='ISO-8859-1')

This is an example for converting a CSV file in Python 3:
try:
inputReader = csv.reader(open(argv[1], encoding='ISO-8859-1'), delimiter=',',quotechar='"')
except IOError:
pass

The encoding replaced with encoding='ISO-8859-1'
for line in open('u.item', encoding='ISO-8859-1'):
print(line)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7044: invalid continuation byte
The above error is occuring due to encoding
Solution:- Use “encoding='latin-1'”
Reference:- https://pandas.pydata.org/docs/search.html?q=encoding

Sometimes when using open(filepath) in which filepath actually is not a file would get the same error, so firstly make sure the file you're trying to open exists:
import os
assert os.path.isfile(filepath)

Open your file with Notepad++, select "Encoding" or "Encodage" menu to identify or to convert from ANSI to UTF-8 or the ISO 8859-1 code page.

So that the web-page is searched faster for the google-request on a similar question (about error with UTF-8), I leave my solvation here for others.
I had problem with .csv file opening with that description:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 150: invalid continuation byte
I opened the file with NotePad & counted 150th position: that was a Cyrillic symbol.
I resaved that file with 'Save as..' command with Encoding 'UTF-8' & my program started to work.

I keep coming across this error and often the solution is not resolved by encoding='utf-8' but in fact with engine='python' like this:
import pandas as pd
file = "c:\\path\\to_my\\file.csv"
df = pd.read_csv(file, engine='python')
df
A link to the docs is here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Use this, if you are directly loading data from github or kaggle DF=pd.read_csv(file,encoding='ISO-8859-1')

In my case, this issue occurred because I modified the extension of an excel file (.xlsx) directly into a (.csv) file directly...
The solution was to open the file then save it as new (.csv) file (i.e. file -> save as -> select the (.csv) extension and save it. This worked for me.

My issue was similar in that UTF-8 text was getting passed to the Python script.
In my case, it was from SQL using the sp_execute_external_script in the Machine Learning service for SQL Server. For whatever reason, VARCHAR data appears to get passed as UTF-8, whereas NVARCHAR data gets passed as UTF-16.
Since there's no way to specify the default encoding in Python, and no user-editable Python statement parsing the data, I had to use the SQL CONVERT() function in my SELECT query in the #input_data parameter.
So, while this query
EXEC sp_execute_external_script #language = N'Python',
#script = N'
OutputDataSet = InputDataSet
',
#input_data_1 = N'SELECT id, text FROM the_error;'
WITH RESULT SETS (([id] int, [text] nvarchar(max)));
gives the error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 0: unexpected end of data
Using CONVERT(type, data) (CAST(data AS type) would also work)
EXEC sp_execute_external_script #language = N'Python',
#script = N'
OutputDataSet = InputDataSet
',
#input_data_1 = N'SELECT id, CONVERT(NVARCHAR(max), text) FROM the_error;'
WITH RESULT SETS (([id] INT, [text] NVARCHAR(max)));
returns
id text
1 Ç

Related

Decode error with Pandas reading a .txt that only occurs on one compuer

I have a comma-separated .txt file with French characters such as Vétérinaire and Désinfectant.
import pandas as pd
df = pd.read_csv('somefile.txt', sep=',', header=None, encoding='utf-8')
[Decode error - output not utf-8]
I have read many Q&A posts (including this) and tried many different encoding such as 'latin1' and 'utf-16', they didn't work. However, I tried to run the exact same script on the different Windows 10 computer with similar Python setup (all Python 3.6), it works perfectly fine in the other computer.
Edit: I tried this. Using encoding='cp1252' helps for some of the .txt files I want to import, but for a few .txt files, it gives the following error.
File "C:\Program_Files_Extra\Anaconda3\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 25: character maps to <undefined>
Edit:
Trying to identify encoding from chardet
import chardet
import pandas as pd
test_txt = 'somefile.txt'
rawdata = open(test_txt, 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']
print (charenc)
df = pd.read_csv(test_txt, sep=',', header=None, encoding=charenc)
print (df.head())
utf-8
[Decode error - output not utf-8]
Your program opens your files with a default encoding and that doesn't match the contents of the file you are trying to open.
Option 1: Decode the file contents to python string objects:
rawdata = open(test_txt, 'rb', encoding='UTF8').read()
Option 2: Open the csv file in an editor like Sublime Text and save it with utf-8 encoding to easily read the file through pandas.

Not able to read file due to unicode error in python

I'm trying to read a file and when I'm reading it, I'm getting a unicode error.
def reading_File(self,text):
url_text = "Text1.txt"
with open(url_text) as f:
content = f.read()
Error:
content = f.read()# Read the whole file
File "/home/soft/anaconda/lib/python3.6/encodings/ascii.py", line 26, in
decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 404:
ordinal not in range(128)
Why is this happening? I'm trying to run the same on Linux system, but on Windows it runs properly.
According to the question,
i'm trying to run the same on Linux system, but on Windows it runs properly.
Since we know from the question and some of the other answers that the file's contents are neither ASCII nor UTF-8, it's a reasonable guess that the file is encoded with one of the 8-bit encodings common on Windows.
As it happens 0x92 maps to the character 'RIGHT SINGLE QUOTATION MARK' in the cp125* encodings, used on US and latin/European regions.
So probably the the file should be opened like this:
# Python3
with open(url_text, encoding='cp1252') as f:
content = f.read()
# Python2
import codecs
with codecs.open(url_text, encoding='cp1252') as f:
content = f.read()
There can be two reasons for that to happen:
The file contains text encoded with an encoding different than 'ascii' and, according you your comments to other answers, 'utf-8'.
The file doesn't contain text at all, it is binary data.
In case 1 you need to figure out how the text was encoded and use that encoding to open the file:
open(url_text, encoding=your_encoding)
In case 2 you need to open the file in binary mode:
open(url_text, 'rb')
As it looks, default encoding is ascii while Python3 it's utf-8, below syntax to open the file can be used
open(file, encoding='utf-8')
Check your system default encoding,
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
If it's not UTF-8, reset the encoding of your system.
export LANGUAGE=en_US.UTF-8
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LC_TYPE=en_US.UTF-8
You can use codecs.open to fix this issue with the correct encoding:
import codecs
with codecs.open(filename, 'r', 'utf8' ) as ff:
content = ff.read()

Error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

https://github.com/affinelayer/pix2pix-tensorflow/tree/master/tools
An error occurred when compiling "process.py" on the above site.
python tools/process.py --input_dir data -- operation resize --outp
ut_dir data2/resize
data/0.jpg -> data2/resize/0.png
Traceback (most recent call last):
File "tools/process.py", line 235, in <module>
main()
File "tools/process.py", line 167, in main
src = load(src_path)
File "tools/process.py", line 113, in load
contents = open(path).read()
File"/home/user/anaconda3/envs/tensorflow_2/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
What is the cause of the error?
Python's version is 3.5.2.
Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
Since you did not provide any code we could look at, we only could guess on the rest.
From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()). I propose to recode this in a fashion like this:
with open(path, 'rb') as f:
contents = f.read()
That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.
Use this solution it will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.
with open(path, encoding="utf8", errors='ignore') as f:
Using errors='ignore'
You'll just lose some characters. but if your don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server.
Then its a easy direct solution.
reference
Use encoding format ISO-8859-1 to solve the issue.
Had an issue similar to this, Ended up using UTF-16 to decode. my code is below.
with open(path_to_file,'rb') as f:
contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
contents = contents.split("\r\n")
this would take the file contents as an import, but it would return the code in UTF format. from there it would be decoded and seperated by lines.
I've come across this thread when suffering the same error, after doing some research I can confirm, this is an error that happens when you try to decode a UTF-16 file with UTF-8.
With UTF-16 the first characther (2 bytes in UTF-16) is a Byte Order Mark (BOM), which is used as a decoding hint and doesn't appear as a character in the decoded string. This means the first byte will be either FE or FF and the second, the other.
Heavily edited after I found out the real answer
It simply means that one chose the wrong encoding to read the file.
On Mac, use file -I file.txt to find the correct encoding. On Linux, use file -i file.txt.
I had a similar issue with PNG files. and I tried the solutions above without success.
this one worked for me in python 3.8
with open(path, "rb") as f:
use only
base64.b64decode(a)
instead of
base64.b64decode(a).decode('utf-8')
This is due to the different encoding method when read the file. In python, it defaultly
encode the data with unicode. However, it may not works in various platforms.
I propose an encoding method which can help you solve this if 'utf-8' not works.
with open(path, newline='', encoding='cp1252') as csvfile:
reader = csv.reader(csvfile)
It should works if you change the encoding method here. Also, you can find other encoding method here standard-encodings , if above doesn't work for you.
Those getting similar errors while handling Pandas for data frames use the following solution.
example solution.
df = pd.read_csv("File path", encoding='cp1252')
I had this UnicodeDecodeError while trying to read a '.csv' file using pandas.read_csv(). In my case, I could not manage to overcome this issue using other encoder types. But instead of using
pd.read_csv(filename, delimiter=';')
I used:
pd.read_csv(open(filename, 'r'), delimiter=';')
which just seems working fine for me.
Note that: In open() function, use 'r' instead of 'rb'. Because 'rb' returns bytes object that causes to happen this decoder error in the first place, that is the same problem in the read_csv(). But 'r' returns str which is needed since our data is in .csv, and using the default encoding='utf-8' parameter, we can easily parse the data using read_csv() function.
if you are receiving data from a serial port, make sure you are using the right baudrate (and the other configs ) : decoding using (utf-8) but the wrong config will generate the same error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
to check your serial port config on linux use : stty -F /dev/ttyUSBX -a
I had a similar issue and searched all the internet for this problem
if you have this problem just copy your HTML code in a new HTML file and use the normal <meta charset="UTF-8">
and it will work....
just create a new HTML file in the same location and use a different name
Check the path of the file to be read. My code kept on giving me errors until I changed the path name to present working directory. The error was:
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
If you are on a mac check if you for a hidden file, .DS_Store. After removing the file my program worked.
I had a similar problem.
Solved it by:
import io
with io.open(filename, 'r', encoding='utf-8') as fn:
lines = fn.readlines()
However, I had another problem. Some html files (in my case) were not utf-8, so I received a similar error. When I excluded those html files, everything worked smoothly.
So, except from fixing the code, check also the files you are reading from, maybe there is an incompatibility there indeed.
You have to use the encoding as latin1 to read this file as there are some special character in this file, use the below code snippet to read the file.
The problem here is the encoding type. When Python can't convert the data to be read, it gives an error.
You can you latin1 or other encoding values.
I say try and test to find the right one for your dataset.
I have the same issue when processing a file generated from Linux. It turns out it was related with files containing question marks..
Following code worked in my case:
df = pd.read_csv(filename,sep = '\t', encoding='cp1252')
If possible, open the file in a text editor and try to change the encoding to UTF-8. Otherwise do it programatically at the OS level.

Error while importing csv in Python using pandas

I have started to learn Python for data science. I am already using R on almost daily basis. I stack on first step. I try to import csv file using Pandas read_csv file method. I have problem with encoding the file while importing.
If I use read.csv from R everything is ok:
df <- read.csv2("some_path/myfile.txt", stringsAsFactors = FALSE, encoding = 'UTF-8')
but if I use similar code in python:
import pandas as pd
df = pd.read_csv("some_path/myfile.txt", sep = ';', encoding= 'utf8')
it returns an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc6 in position 13: invalid continuation byte
How is it possible that I can import a file with "utf-8" encoding in R, but not in Python?
If I use different encoding (latin1 or iso-8859-1), it imports the file successfully but characters are not encoded in right way.
Even if I don't understand why UTF-8 works in R but not in Python, I found out that cp1250 encoding works fine.
Use encoding "UTF-16". I used that to resolve my issue with the same error.

Python: How do I read and parse a unicode utf-8 text file?

I am exporting UTF-8 text from Excel and I want to read and parse the incoming data using Python. I've read all the online info so I've already tried this, for example:
txtFile = codecs.open( 'halout.txt', 'r', 'utf-8' )
for line in txtFile:
print repr( line )
The error I am getting is:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: unexpected code byte
Looking at the text file in a Hex editor, the first values are FFFE I've also tried:
txtFile.seek( 2 )
right after the 'open' but that just causes a different error.
That file is not UTF-8; it's UTF-16LE with a byte-order marker.
That is a BOM
EDIT, from the coments, it seems to be a utf-16 bom
codecs.open('foo.txt', 'r', 'utf-16')
should work.
Expanding on Johnathan's comment, this code should read the file correctly:
import codecs
txtFile = codecs.open( 'halout.txt', 'r', 'utf-16' )
for line in txtFile:
print repr( line )
Try to see if the excel file has some blank rows (and then has values again), that might cause the unexpected error.

Categories

Resources