Not able to read file due to unicode error in python - python

I'm trying to read a file and when I'm reading it, I'm getting a unicode error.
def reading_File(self,text):
url_text = "Text1.txt"
with open(url_text) as f:
content = f.read()
Error:
content = f.read()# Read the whole file
File "/home/soft/anaconda/lib/python3.6/encodings/ascii.py", line 26, in
decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 404:
ordinal not in range(128)
Why is this happening? I'm trying to run the same on Linux system, but on Windows it runs properly.

According to the question,
i'm trying to run the same on Linux system, but on Windows it runs properly.
Since we know from the question and some of the other answers that the file's contents are neither ASCII nor UTF-8, it's a reasonable guess that the file is encoded with one of the 8-bit encodings common on Windows.
As it happens 0x92 maps to the character 'RIGHT SINGLE QUOTATION MARK' in the cp125* encodings, used on US and latin/European regions.
So probably the the file should be opened like this:
# Python3
with open(url_text, encoding='cp1252') as f:
content = f.read()
# Python2
import codecs
with codecs.open(url_text, encoding='cp1252') as f:
content = f.read()

There can be two reasons for that to happen:
The file contains text encoded with an encoding different than 'ascii' and, according you your comments to other answers, 'utf-8'.
The file doesn't contain text at all, it is binary data.
In case 1 you need to figure out how the text was encoded and use that encoding to open the file:
open(url_text, encoding=your_encoding)
In case 2 you need to open the file in binary mode:
open(url_text, 'rb')

As it looks, default encoding is ascii while Python3 it's utf-8, below syntax to open the file can be used
open(file, encoding='utf-8')
Check your system default encoding,
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
If it's not UTF-8, reset the encoding of your system.
export LANGUAGE=en_US.UTF-8
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LC_TYPE=en_US.UTF-8

You can use codecs.open to fix this issue with the correct encoding:
import codecs
with codecs.open(filename, 'r', 'utf8' ) as ff:
content = ff.read()

Related

Can't read a .csv with python: "UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position..." [duplicate]

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`
The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")
If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)
Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary
As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)
TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).
Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed
for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")
For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"
Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.
def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))
In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.
for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

ERROR: UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 715: character maps to <undefined>

I am using Python Notebook to open this text file in Windows 10. However UTF-8 encoding isn't working. How should I solve this error?
Python is trying to open the file using your system's default encoding, but the bytes in the file cannot be decoded with that encoding.
You need to pass the correct encoding to open. We don't know what the correct encoding is, but the most likely candidates are UTF-8 or, less common these days, latin-1. So you would do something like
with open('myfile.txt', 'r', encoding='UTF-8') as f:
for line in f:
# do something with the line

Python - Reading CSV UnicodeError

I have exported a CSV from Kaggle - https://www.kaggle.com/ngyptr/python-nltk-sentiment-analysis. However, when I attempt to iterate through the file, I receive unicode errors concerning certain characters that cannot be encoded.
File "C:\Program Files\Python35\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 264: character maps to
I have enabled utf-8 encoding while opening the file, which I assumed would have decoded the ASCII characters. Evidently not.
My Code:
with open("sentimentDataSet.csv", "r", encoding="utf-8" ,errors='ignore', newline='') as file:
reader = csv.reader(file)-
for row in reader:
if row:
print(row)
if row[sentimentCsvColumn] == sentimentScores(row[textCsvColumn]):
accuracyCount += 1
print(accuracyCount)
That's an encode error as you're printing the row, and has little to do with reading the actual CSV.
Your Windows terminal is in CP850 encoding, which can't represent everything.
There are some things you can do here.
A simple way is to set the PYTHONIOENCODING environment variable to a combination that will trash things it can't represent. set PYTHONIOENCODING=cp850:replace before running Python will have Python replace characters unrepresentable in CP850 with question marks.
Change your terminal encoding to UTF-8: chcp 65001 before running Python.
Encode the thing by hand before printing: print(str(data).encode('ascii', 'replace'))
Don't print the thing.

what encoding should I use to open a file with a big letter N with tilde character?

I'm trying to open a file with a big letter N tilde (http://graphemica.com/%C3%91) but I can't seem to figure it out. when I open the file in notepad++ it shows the character as xD1, when I open the file in gedit it shows \D1. When I open the file in excel, it shows the character correctly.
Now I'm trying to open the file in python, it halts when it encounters the character. I'm aware that I can put in the encoding so the file can be opened properly but I'm not sure which encoding I should use. My error is
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 0: invalid continuation byte
this is my code
with codecs.open('tsv.txt', 'r', 'utf8') as my_file:
for line in my_file:
print(line)
if it is not utf8, then what should I use? From the site above, it does not show which encoding 0xd1 is associated with.
You can find in tables how 'Ñ' gets encoded in different encodings.
You can also try it directly with Python:
>>> 'Ñ'.encode('utf8')
b'\xc3\x91'
>>> 'Ñ'.encode('latin1')
b'\xd1'
It seems that your file is encoded in latin-1.

UnicodeDecodeError while loading file in python

I'm running this:
news_train = load_mlcomp('20news-18828', 'train')
vectorizer = TfidfVectorizer(encoding='latin1')
X_train = vectorizer.fit_transform((open(f, errors='ignore').read()
for f in news_train.filenames))
but it got UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 39: invalid continuation byte. at open() function.
I checked the news_train.filenames. It is:
array(['/Users/juby/Downloads/mlcomp/379/train/sci.med/12836-58920',
..., '/Users/juby/Downloads/mlcomp/379/train/sci.space/14129-61228'],
dtype='<U74')
Paths look correct. It may be about dtype or my environment (I'm Mac OSX 10.11), but I can't fix it after I tried many times. Thank you!!!
p.s it's a ML tutorial from http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#example-text-mlcomp-sparse-document-classification-py
Well I found the solution. Using
open(f, encoding = "latin1")
I'm not sure why it only happens on my mac though. Wish to know it.
Actually in Python 3+, the open function opens and reads file in default mode 'r' which will decode the file content (on most platform, in UTF-8). Since your files are encoded in latin1, decode them using UTF-8 could cause UnicodeDecodeError. The solution is either opening the files in binary mode ('rb'), or specify the correct encoding (encoding="latin1").
open(f, 'rb').read() # returns `byte` rather than `str`
# or,
open(f, encoding='latin1').read() # returns latin1 decoded `str`

Categories

Resources