UnicodeDecodeError with nltk - python

I am working with python2.7 and nltk on a large txt file of content scraped from various websites..however I am getting various unicode errors such as
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)
My question is not so much how I can 'fix' this with python but instead is there anything I can do to the .txt file (as in formatting) before 'feeding' it to python, such as 'make plain text' to avoid this issue entirely?
Update:
I looked around and found a solution within python that seems to work perfectly:
import sys
reload(sys)
sys.setdefaultencoding('utf8')

try opening the file with:
f = open(fname, encoding="ascii", errors="surrogateescape")
Change the "ascii" with the desired encoding.

Related

Python3 now throwing ASCII error loading data from JSON file

I am relatively new to Python and spent at least two hours searching both the internet and now StackOverFlow and could not find out what the problem is here. My code was working in Python2, now persistent error message below. I even found this code online for an apparent answer to my question claiming it worked in python3, but it's not for me. Odd.
import json, sys
with open('2689364.json', 'r', encoding='utf-8') as json_data:
d = json.load(json_data)
print(d)
UnicodeEncodeError: 'ascii' codec can't encode character '\xb6' in position 1938: ordinal not in range(128)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 11597: ordinal not in range(128)

I am using well known 20 Newsgroups data set for text categorisation in jupyter. When I try to open and read file on my Mac, it fails at decoding step. I tried to read the file in byte format, which works but I further need to work with it as string. I tried to encode it but it fails with the error.
Code
with open(file_path, 'rb') as f:
file_read=f.read()
file_read.decode("us-ascii")
Error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 11597: ordinal not in range(128)
Us-ascii is file encoding I found when typing in terminal: file -I file_name. I tried other encodings but none works.
I further want to remove punctuation and count words in the file. Is there a way how to overcome this issue?
It is tricky without looking at the file. However this works most of the time
from codecs import open
file_path = "file_name"
with open(file_path, 'rb') as f:
file_read=f.read()
Setting error to ignore resolved the problem, thanks N M. The code looks like:
ref_file=open(ref_file_path, 'r', encoding='ascii', errors='ignore')
file_read=ref_file.read()
The code further treats it as a one big string. Note that although the error was about decoding 0xff it was not UTF-16 coding.

Python encoding issue while reading a file

I am trying to read a file that contains this character in it "ë". The problem is that I can not figure out how to read it no matter what I try to do with the encoding. When I manually look at the file in textedit it is listed as a unknown 8-bit file. If I try changing it to utf-8, utf-16 or anything else it either does not work or messes up the entire file. I tried reading the file just in standard python commands as well as using codecs and can not come up with anything that will read it correctly. I will include a code sample of the read below. Does anyone have any clue what I am doing wrong? This is Python 2.17.10 by the way.
readFile = codecs.open("FileName",encoding='utf-8')
The line I am trying to read is this with nothing else in it.
Aeëtes
Here are some of the errors I get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x91 in position 0: invalid start byte
UTF-16 stream does not start with BOM"
UnicodeError: UTF-16 stream does not start with BOM -- I know this one is that it is not a utf-16 file.
UnicodeDecodeError: 'ascii' codec can't decode byte 0x91 in position 0: ordinal not in range(128)
If I don't use a Codec the word comes in as Ae?tes which then crashes later in the program. Just to be clear, none of the suggested questions or any other anywhere on the net have pointed to an answer. One other detail that might help is that I am using OS X, not Windows.
Credit for this answer goes to RadLexus for figuring out the proper encoding and also to Mad Physicist who pointed me in the right track even if I did not consider all possible encodings.
The issue is apparently a Mac will convert the .txt file to mac_roman. If you use that encoding it will work perfectly.
This is the line of code that I used to convert it.
readFile = codecs.open("FileName",encoding='mac_roman')

Cant encode ascii character u'\xe9'

I am getting a message 'ascii' codec can't encode character u'\xe9' when I am writing a string to my file, heres how I am writing my file
my_file = open(output_path, "w")
my_file.write(output_string)
my_file.close()
I have been searching and found answers like this UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 7: ordinal not in range(128) and the first one didn't work and then this one I'm confused why I am encoding data I want to be able to read
import io
f = io.open(filename, 'w', encoding='utf8')
Thanks for the help
As mentioned, you're trying to write non-ASCII characters with the ASCII encoding. Since the built-in open function doesn't support the encoding parameter, then consider always using io.open in Python 2.7 (which is the default since Python 3.x).

Strange characters in console Python

Reading the word "beyoncè" from a text file, python is handling it as "beyonc\xc3\xa9".
If I write it into a file, it shows correctly, but in console It shows like that.
Also If I try to use it in my program I get:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17: ordinal not in range(128)
how can I let Python read beyoncè from a text file as beyonce and getting rid of this problem?
See if this helps:
f= open('mytextfile.txt', encoding='utf-8', 'w')
f.write(line)
try
string="beyonc\xc3\xa9"
string.decode("utf-8")
foo=open("foo.txt","wb")
foo.write(string)
foo.close()

Categories

Resources