NLTK Python word_tokenize [duplicate]

NLTK Python word_tokenize [duplicate] - python

This question already has answers here:
How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"
(20 answers)
Python (nltk) - UnicodeDecodeError: 'ascii' codec can't decode byte
(4 answers)
Closed 4 years ago.
I have loaded a txt file that contains 6000 lines of sentences. I have tried to split("/n") and word_tokenize the sentences, but I get the following error:
Traceback (most recent call last):
File "final.py", line 52, in <module>
short_pos_words = word_tokenize(short_pos)
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 128, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 95, in sent_tokenize
return tokenizer.tokenize(text)
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1237, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1285, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1276, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1316, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 313, in _pair_iter
for el in it:
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1291, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1337, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1472, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 312, in _pair_iter
prev = next(it)
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 581, in _annotate_first_pass
for aug_tok in tokens:
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 546, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)

The issue is related to the encoding of file's content. Assuming that you want to decode str to UTF-8 unicode:
Option 1 (Deprecated in Python 3):
import sys
reload(sys)
sys.setdefaultencoding('utf8')
Option 2:
Pass the encode parameter to the open function when trying to open your text file:
f = open('/path/to/txt/file', 'r+', encoding="utf-8")

Related

'utf-8' codec can't decode byte 0xe3 in position 87 word2vec gensim

i have code
import time
import multiprocessing
from datetime import timedelta
from gensim.models import word2vec
start_time = time.time()
print('Training Word2Vec Model...')
sentences = word2vec.LineSentence('data/data_text.txt')
id_w2v = word2vec.Word2Vec(sentences, size=300, workers=multiprocessing.cpu_count()-1)
id_w2v.save('model_terbaru/word2vec_300.model')
when i make model, i have an error
Traceback (most recent call last):
File"<ipython-input-10-fc7016864a34>", line 1, in <module>
runfile('F:/pa reza/model.py', wdir='F:/pa reza')
File "C:\ProgramData\Anaconda\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 704, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "F:/pa reza/model.py", line 13, in <module>
iter=10)
File "C:\ProgramData\Anaconda\lib\site-packages\gensim\models\word2vec.py", line 527, in init
fast_version=FAST_VERSION)
File "C:\ProgramData\Anaconda\lib\site-packages\gensim\models\base_any2vec.py", line 335, in __init__
self.build_vocab(sentences, trim_rule=trim_rule)
File "C:\ProgramData\Anaconda\lib\site-packages\gensim\models\base_any2vec.py", line 480, in build_vocab
sentences, progress_per=progress_per, trim_rule=trim_rule)
File "C:\ProgramData\Anaconda\lib\site-packages\gensim\models\word2vec.py", line 1151, in scan_vocab
for sentence_no, sentence in enumerate(sentences):
File "C:\ProgramData\Anaconda\lib\site-packages\gensim\models\word2vec.py", line 1073, in __iter__
line = utils.to_unicode(line).split()
File "C:\ProgramData\Anaconda\lib\site-packages\gensim\utils.py", line 359, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 87: invalid continuation byte
help me....

UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 5: invalid start byte

I am using Python 2.7 and had this error that I can't fix. I am trying to download HTMLs from a page and the next button looks like this : Next »
Traceback (most recent call last):
File "C:\Users\Said&Nour\Desktop\Documents\PythonFiles\LebanonParsing\Al Rifai\alrifai.py", line 109, in <module>
if PageP.find('a',attrs={'title':'Next »'}) is None:
File "C:\Python27\lib\site-packages\bs4\element.py", line 1300, in find
l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
File "C:\Python27\lib\site-packages\bs4\element.py", line 1321, in find_all
return self._find_all(name, attrs, text, limit, generator, **kwargs)
File "C:\Python27\lib\site-packages\bs4\element.py", line 602, in _find_all
strainer = SoupStrainer(name, attrs, text, **kwargs)
File "C:\Python27\lib\site-packages\bs4\element.py", line 1420, in __init__
normalized_attrs[key] = self._normalize_search_value(value)
File "C:\Python27\lib\site-packages\bs4\element.py", line 1434, in _normalize_search_value
return value.decode("utf8")
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 5: invalid start byte

Problems with utf-8 in python

My code is below.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pandas as pd
import codecs
df1 = pd.read_csv(r'E:\내논문자료\wordcloud\test1\1311_1312.csv',encoding='utf-8')
df2 = df1.groupby(['address']).size().reset_index()
df2.rename(columns = {0: 'frequency'}, inplace = True)
print(df2[:100])
But When I execute this code I got this message
Traceback (most recent call last):
File "E:/빅데이터 캠퍼스/untitled1/groupby freq.py", line 7, in <module>
df1 = pd.read_csv(r'E:\내논문자료\wordcloud\test1\1311_1312.csv',encoding='utf-8')
File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 645, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 400, in _read
data = parser.read()
File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 938, in read
ret = self._engine.read(nrows)
File "C:\Python34\lib\site-packages\pandas\io\parsers.py", line 1507, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 846, in pandas.parser.TextReader.read (pandas\parser.c:10364)
File "pandas\parser.pyx", line 868, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:10640)
File "pandas\parser.pyx", line 945, in pandas.parser.TextReader._read_rows (pandas\parser.c:11677)
File "pandas\parser.pyx", line 1047, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:13111)
File "pandas\parser.pyx", line 1106, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:14065)
File "pandas\parser.pyx", line 1204, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:16121)
File "pandas\parser.pyx", line 1220, in pandas.parser.TextReader._string_convert (pandas\parser.c:16349)
File "pandas\parser.pyx", line 1452, in pandas.parser._string_box_utf8 (pandas\parser.c:22014)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 0: invalid start byte
How can I solve it??
Should I alter parsers code in pandas??

It looks like your source data hasn't been encoded with UTF-8 - it's likely to be one of the other codecs. Per this answer, you might want to try with encoding='GBK' to start with, or encoding='gb2312'.

Python NLTK Word Tokenize UnicodeDecode Error

I get the error when trying the below code. I try to read from a text file and tokenize the words using nltk. Any ideas? The text file can be found here
from nltk.tokenize import word_tokenize
short_pos = open("./positive.txt","r").read()
#short_pos = short_pos.decode('utf-8').lower()
short_pos_words = word_tokenize(short_pos)
Error:
Traceback (most recent call last):
File "sentimentAnalysis.py", line 19, in <module>
short_pos_words = word_tokenize(short_pos)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 106, in word_tokenize
return [token for sent in sent_tokenize(text, language)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 91, in sent_tokenize
return tokenizer.tokenize(text)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 311, in _pair_iter
for el in it:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
prev = next(it)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
for aug_tok in tokens:
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 6: ordinal not in range(128)
Thanks for your support.

Looks like this text is encoded in Latin-1. So this works for me:
import codecs
with codecs.open("positive.txt", "r", "latin-1") as inputfile:
text=inputfile.read()
short_pos_words = word_tokenize(text)
print len(short_pos_words)
You can test for different encodings by e.g. looking at the file in a good editor like TextWrangler. You can
1) open the file in different encodings to see which one looks good and
2) look at the character that caused the issue. In your case, that is the character in position 4645 - which happens to be an accented word from a Spanish review. That is not part of Ascii, so that doesn't work; it's also not a valid codepoint in UTF-8.

Your file is encoded using "latin-1".
from nltk.tokenize import word_tokenize
import codecs
with codecs.open("positive.txt", "r", "latin-1") as inputfile:
text=inputfile.read()
short_pos_words = word_tokenize(text)
print short_pos_words

'utf8' codec can't decode byte 0xdf in position 59: invalid continuation byte

I have the following string which I am trying to embed in json:
mystr = '<url host="bing.com" method="GET" uri="/update?v=1&uid=\xdf\xe2\x80|\xff\xff\xff\xff\xf3\x99\x83|\x88\xe2\x80|\xff\xff\xff\xff\xf6\x80|n\x80|&os=45" user_agent=""/>\n <url host="zaloopa.co.cc" method="GET" uri="/update?v=1&uid=\xdf\xe2\x80|\xff\xff\xff\xff\xf3\x99\x83|\x88\xe2\x80|\xff\xff\xff\xff\xf6\x80|n\x80|&os=45" user_agent=""/>'
import json
json.dumps('url':mystr)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/json/__init__.py", line 230, in dumps
return _default_encoder.encode(obj)
File "/usr/lib64/python2.6/json/encoder.py", line 367, in encode
chunks = list(self.iterencode(o))
File "/usr/lib64/python2.6/json/encoder.py", line 309, in _iterencode
for chunk in self._iterencode_dict(o, markers):
File "/usr/lib64/python2.6/json/encoder.py", line 275, in _iterencode_dict
for chunk in self._iterencode(value, markers):
File "/usr/lib64/python2.6/json/encoder.py", line 294, in _iterencode
yield encoder(o)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 59: invalid continuation byte
I tried mystr.encode('ascii','ignore) but technically each of these values such as \xdf are individual bytes themselves that are within the ascii range. Is there a way I can massage the data and feed into json without it crashing.
There is not hint of what the encoding is of that line is.
Thanks...Amro

mystr = u'<url host="bing.com" method="GET" uri="/update?v=1&uid=\xdf\xe2\x80|\xff\xff\xff\xff\xf3\x99\x83|\x88\xe2\x80|\xff\xff\xff\xff\xf6\x80|n\x80|&os=45" user_agent=""/>\n <url host="zaloopa.co.cc" method="GET" uri="/update?v=1&uid=\xdf\xe2\x80|\xff\xff\xff\xff\xf3\x99\x83|\x88\xe2\x80|\xff\xff\xff\xff\xf6\x80|n\x80|&os=45" user_agent=""/>'
import json
print json.dumps({'url': mystr}, ensure_ascii=False)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

NLTK Python word_tokenize [duplicate] - python

Related

'utf-8' codec can't decode byte 0xe3 in position 87 word2vec gensim

UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 5: invalid start byte

Problems with utf-8 in python

Python NLTK Word Tokenize UnicodeDecode Error

'utf8' codec can't decode byte 0xdf in position 59: invalid continuation byte

Categories

Resources