Python (nltk) - UnicodeDecodeError: 'ascii' codec can't decode byte - python

I'm new to NLTK. I'm getting this error and I've searched around for encoding/decoding and specifically the UnicodeDecodeError but this error seems specific to the NLTK source code.
Here's the error:
Traceback (most recent call last):
File "A:\Python\Projects\Test\main.py", line 2, in <module>
print(pos_tag(word_tokenize("John's big idea isn't all that bad.")))
File "A:\Python\Python\lib\site-packages\nltk\tag\__init__.py", line 100, in pos_tag
tagger = load(_POS_TAGGER)
File "A:\Python\Python\lib\site-packages\nltk\data.py", line 779, in load
resource_val = pickle.load(opened_resource)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)
How do I go around fixing this error?
Here's what causes the error:
from nltk import pos_tag, word_tokenize
print(pos_tag(word_tokenize("John's big idea isn't all that bad.")))

try this... NLTK 3.0.1 with Python 2.7.x
import io
f = io.open(txtFile, 'rU', encoding='utf-8')

I had the same problem with you. I use Python 3.4 in Windows 7.
I had installed the "nltk-3.0.0.win32.exe" (from here). But when i installed the "nltk-3.0a4.win32.exe" (from here), my problem with nltk.pos_tag was solved. Check it.
EDIT: If the second link doesn't work, you can look here.

Duplicate: NLTK 3 POS_TAG throws UnicodeDecodeError
Long story short: NLTK isn't compatible with Python 3. You have to use NLTK 3 which sounds a bit experimental at this point.

Try using the module "textclean"
>>> pip install textclean
Python code
from textclean.textclean import textclean
text = textclean.clean("John's big idea isn't all that bad.")
print pos_tag(word_tokenize(text))

Related

UnicodeError when importing python file

I am trying to import a python file Sonderbuch_BASECASE_3ph.py into another python file test.py. test.py is in the main dir foo while Sonderbuch_BASECASE_3ph.py is in a subdir grid_data.
Sonderbuch_BASECASE_3ph.py has a function with the same name, which I need to import as well:
# Sonderbuch_BASECASE_3ph
from numpy import array
def Sonderbuch_BASECASE_3ph():
.....
Both of these attempts to import result in a SyntaxError:
from grid_data import Sonderbuch_BASECASE_3ph
import grid_data.Sonderbuch_BASECASE_3ph
Output:
Traceback (most recent call last):
File "C:/Users/Artur/Desktop/foo/test.py", line 1, in <module>
from grid_data import Sonderbuch_BASECASE_3ph
File "C:\Users\Artur\Desktop\foo\grid_data\Sonderbuch_BASECASE_3ph.py", line 1550
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xe4 in position 29: invalid continuation byte
Edit:
The encoding of the file seems to be windows-1252, at least that is what pycharm is proposing. Decoding the file in windows-1252 does not solve the ErrorMsg though. Sonderbuch_BASECASE_3hp.py is just a storage file for a dictionary. I was hoping I could just import it.
None of the encodings seem to work.
What's in your Sonderbuch_BASECASE_3ph.py file exactly?
I guess that the files use different encoding hence importing one to another may result in error. My guess is that your test.py is in UTF-8 while the other file is encoded with latin-1 or something like that. Check what's the encoding of the files (you can do it in PyCharm, Sublime, Notepad++, etc.). In Pycharm, you can see the encoding of a file at the bottom right (by default).

UnicodeDecodeError with nltk

I am working with python2.7 and nltk on a large txt file of content scraped from various websites..however I am getting various unicode errors such as
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)
My question is not so much how I can 'fix' this with python but instead is there anything I can do to the .txt file (as in formatting) before 'feeding' it to python, such as 'make plain text' to avoid this issue entirely?
Update:
I looked around and found a solution within python that seems to work perfectly:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
try opening the file with:
f = open(fname, encoding="ascii", errors="surrogateescape")
Change the "ascii" with the desired encoding.

UnicodeDecodeError even when importing simple txt file in Python

I would like to import even a simple text file into Python. For example, here's the contents of example.txt:
hello
my
friend
Very simple. However, when I try to import the file and read it:
f = open('example.txt')
f.read()
I get the following error:
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
f.read()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
What's the source of this problem? Clearly there are not any non-ascii characters in the file.
I've tried this in IDLE, terminal (Mac OSX) and Rodeo and get similar issues in all.
I'm very new to Python and am concerned I may have screwed up something in installation. I've downloaded various versions over the years, straight from Python, Anaconda, macports, etc. and I'm wondering if the various sources are not playing nicely...
Python 3.5.1 on OSX 10.11.4.
Maybe your file is saved with the encoding UTF-8 with BOM (Byte order mark). Try to save you file explicit as UTF-8 (without BOM). While the BOM is not included in the ASCII codec, it causes an UnicodeError.
Hope this helps!

numpy loadtxt, unicode, and python 2 or 3

I have a (windows) text file reported by linux as being a:
ISO-8859 text, with very long lines, with CRLF line terminators
I want to read this into numpy, except the first line which contains labels (with special characters, usually only the greek mu).
Python 2.7.6, Numpy 1.8.0, this works perfectly:
data = np.loadtxt('input_file.txt', skiprows=1)
Python 3.4.0, Numpy 1.8.0, gives an error:
>>> np.loadtxt('input_file.txt', skiprows=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/site-packages/numpy/lib/npyio.py", line 796, in loadtxt
next(fh)
File "/usr/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 4158: invalid start byte
To me this is "buggy" behaviour for the following reasons:
I want to skip the first line so it should be ignored, regardless of its encoding
If I delete the first line from the file, loadtxt works fine in both versions of python
Shouldn't numpy.loadtxt behave the same in python2 and python3?
Questions:
How to get around this problem (using python3 of course)?
Should I file a bug report or is this expected behaviour?
It seems like a bug in loadtxt(), try to use genfromtxt() instead.
Yes, it seems to be a bug in Numpy - it tries to do some parsing even in skipped rows and fails. Better report it.
By the time, the documentation says that loadtxt supports fileobject or string generator as its first argument. Try this
f = open ('load_file.txt')
f.readline()
data = np.loadtxt(f)
P.S. Error 'utf-8' codec can't decode byte 0xb5 in position 4158 don't seem to happen in the beginning of the file. Are you sure your file does not contain some weird symbol that is invisible or looks like space but actually is not?

Can't use NLTK on text pulled from the Silmarillion

I'm trying to use Tolkein's Silmarillion as a practice text for learning some NLP with nltk.
I am having trouble getting started because I'm running into text encoding issues.
I'm using the TextBlob wrapper (https://github.com/sloria/TextBlob) around NLTK because it's a lot easier. TextBlog is available at:
The sentence that I can't parse is:
"But Húrin did not answer, and they sat beside the stone, and did not speak again".
I believe it's the special character in Hurin causing the issue.
My code:
from text.blob import TextBlob
b = TextBlob( 'But Húrin did not answer, and they sat beside the stone, and did not speak again' )
b.noun_phrases
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
As this is just a for-fun project, I just want to be able to use this text and extracting some attributes and do some basic processing.
How can I convert this text to ASCII when I don't know what the initial encoding is? I tried to decode from UTF8, then re-encode into ASCII:
>>> asc = unicode_text.decode('utf-8')
>>> asc = unicode_text.encode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)
But even that doesn't worry. Any suggestions are appreciated -- I'm fine with losing the special characters, as long as it's done consistently across the document.
I'm using python 2.6.8 with the required modules also correctly installed.
First, update TextBlob to the latest version (0.6.0 as of this writing), as there have some unicode fixes in recent updates. This can be done with
$ pip install -U textblob
Then, use a unicode literal, like so:
from text.blob import TextBlob
b = TextBlob( u'But Húrin did not answer, and they sat beside the stone, and did not speak again' )
noun_phrases = b.noun_phrases
print noun_phrases
# WordList([u'h\xfarin'])
print noun_phrases[0]
# húrin
This is verified on Python 2.7.5 with TextBlob 0.6.0, but it should work with Python 2.6.8 as well.

Categories

Resources