How is that possible, that with the same input I sometime get ascii codec error, and sometime it works just fine? The code cleans the name and build it's Soundex and DMetaphone values. It works in ~1 out of 5 runs, sometimes more often :)
UPD: Looks like that's an issue of fuzzy.DMetaphone, at least on Python2.7 with Unicode. Plan to integrate Metaphone instead, for now. All solutions for fuzzy.DMetaphone problem are very welcome :)
UPD 2: Problem is gone after fuzzy update to 1.2.2. The same code works fine.
import re
import fuzzy
import sys
def make_namecard(full_name):
soundex = fuzzy.Soundex(4)
dmeta = fuzzy.DMetaphone(4)
names = process_name(full_name)
print names
soundexes = map(soundex, names)
dmetas = []
for name in names:
print name
dmetas.extend(list(dmeta(name)))
dmetas = filter(bool, dmetas)
return {
"full_name": full_name,
"soundex": soundexes,
"dmeta": dmetas,
"names": names,
}
def process_name(full_name):
full_name = re.sub("[_-]", " ", full_name)
full_name = re.sub(r'[^A-Za-z0-9 ]', "", full_name)
names = full_name.split()
names = filter(valid_name, names)
return names
def valid_name(name):
COMMON_WORDS = ["the", "of"]
return len(name) >= 2 and name.lower() not in COMMON_WORDS
print make_namecard('Jerusalem Warriors')
Output:
➜ python2.7 make_namecard.py
['Jerusalem', 'Warriors']
Jerusalem
Warriors
{'soundex': [u'J624', u'W624'], 'dmeta': [u'\x00\x00\x00\x00', u'ARSL', u'ARRS', u'FRRS'], 'full_name': 'Jerusalem Warriors', 'names': ['Jerusalem', 'Warriors']}
➜ python2.7 make_namecard.py
['Jerusalem', 'Warriors']
Jerusalem
Traceback (most recent call last):
File "make_namecard.py", line 38, in <module>
print make_namecard('Jerusalem Warriors')
File "make_namecard.py", line 16, in make_namecard
dmetas.extend(list(dmeta(name)))
File "src/fuzzy.pyx", line 258, in fuzzy.DMetaphone.__call__
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 0: ordinal not in range(128)
Related
When I try to return value from c code to python code i got error.
Traceback (most recent call last):
File "python.py", line 54, in <module>
print("\n\n\n\RESULT: ", str(result, "utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 245: invalid start byte
in c function i try to return json string which got hex data - which I could parse in python and than make another calculation.
Example of returned string is "{"data":"0x123132"}"
In python i use
import ctypes
my_functions = ctypes.cdll.LoadLibrary("./my_functions.so")
my_functions.getJson.argtypes = (ctypes.c_char_p,)
my_functions.EthereumProcessor.restype = ctypes.c_char_p
result=my_functions.getJson()
print("\n\n\n\RESULT: ", str(result, "utf-8"))
When I feed a non-English string into the YouTube API library's
search, it only works during the initial search. If I call list_next(),
it throws a UnicodeEncodeError.
When I use a simple ascii string, everything works correctly.
Any suggestions about what I should do?
Here's a simplified code of what I'm doing:
# -*- coding: utf-8 -*-
import apiclient.discovery
def test(query):
youtube = apiclient.discovery.build('youtube', 'v3', developerKey='xxx')
ys = youtube.search()
req = ys.list(
q=query.encode('utf-8'),
type='video',
part='id,snippet',
maxResults=50
)
while (req):
res = req.execute()
for i in res['items']:
print(i['id']['videoId'])
req = ys.list_next(req, res)
test(u'한글')
test(u'日本語')
test(u'\uD55C\uAE00')
test(u'\u65E5\u672C\u8A9E')
Error message:
Traceback (most recent call last):
File "E:\prj\scripts\yt\search.py", line 316, in _search
req = ys.list_next(req, res)
File "D:\Apps\Python\lib\site-packages\googleapiclient\discovery.py", line 966, in methodNext
parsed[4] = urlencode(newq)
File "D:\Apps\Python\lib\urllib.py", line 1343, in urlencode
v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
Versions:
google-api-python-client (1.6.2)
Python 2.7.13 (Win32)
EDIT: I posted a workaround below.
If anyone else is interested, here's one workaround that works for me:
googleapiclient/discovery.py:
(old) q = parse_qsl(parsed[4])
(new) q = parse_qsl(parsed[4].encode('ascii'))
Explanation
In discovery.py, list_next() parses and unescapes the previous url, then makes a new url from it:
pageToken = previous_response['nextPageToken']
parsed = list(urlparse(request.uri))
q = parse_qsl(parsed[4])
# Find and remove old 'pageToken' value from URI
newq = [(key, value) for (key, value) in q if key != 'pageToken']
newq.append(('pageToken', pageToken))
parsed[4] = urlencode(newq)
uri = urlunparse(parsed)
It seems the problem is when parse_qsl unescapes the unicode parsed[4], it
returns the utf-8 encoded value in a unicode type. urlencode does not like
this:
q = urlparse.parse_qsl(u'q=%ED%95%9C%EA%B8%80')
[(u'q', u'\xed\x95\x9c\xea\xb8\x80')]
urllib.urlencode(q)
UnicodeEncodeError
If parse_qsl is given a plain ascii string, it returns a plain utf-8 encoded string which urlencode likes:
q = urlparse.parse_qsl(u'q=%ED%95%9C%EA%B8%80'.encode('ascii'))
[('q', '\xed\x95\x9c\xea\xb8\x80')]
urllib.urlencode(q)
'q=%ED%95%9C%EA%B8%80'
Im getting a Unicode error when running this test script.
import urllib
import json
movieTitle = "Bee Movie"
title = movieTitle.replace(" ", "+")
year = ""
omdbAPI = "http://www.omdbapi.com/?t={}&y={}&plot=short&r=json".format(
title, year)
print (omdbAPI)
response = urllib.urlopen(omdbAPI)
data = json.loads(response.read())
valid_data = data["Response"]
print ("This data is: " + valid_data)
if valid_data == "True":
print data["Title"]
print data["Year"]
print data["Plot"]
print data["Rated"]
print data["Released"]
print data["Runtime"]
print data["Genre"]
print data["Director"]
print data["Writer"]
print data["Actors"]
print data["Language"]
print data["Country"]
print data["Awards"]
print data["Poster"]
print data["Metascore"]
print data["imdbRating"]
print data["imdbVotes"]
print data["imdbID"]
print data["Type"]
print data["Response"]
elif valid_data == "False":
print ("This data is: " + valid_data)
else:
raise ValueError("The information was not found")
Error :
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 19: ordinal not in range(128)
I guess its because one off the actors seams to have a é character.
I figured out that I could put .encode('utf8') after print data["Actors"] but it don't seams like the smartest thing to do.
I mean a random letter could occur on more places then on actor. And seams odd to go put .encode('utf8') after every instance
UPDATE :
Traceback (most recent call last):
File "/Volumes/postergren_projectDrive/Projekt/programmingSandbox/python/courses/udacity/Programming Foundations with Python/moveis/Advance/media.py", line 25, in <module>
print data["Actors"]
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 19: ordinal not in range(128)
[Finished in 0.1s with exit code 1]
[shell_cmd: "python" -u "/Volumes/postergren_projectDrive/Projekt/programmingSandbox/python/courses/udacity/Programming Foundations with Python/moveis/Advance/media.py"]
[dir: /Volumes/postergren_projectDrive/Projekt/programmingSandbox/python/courses/udacity/Programming Foundations with Python/moveis/Advance]
[path: /usr/bin:/bin:/usr/sbin:/sbin]
Try this at the begining of your code:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
You can do this:
for key in data.keys()
data[key] = data[key].encode('utf8')
I am playing around with NLTK to do an assignment on sentiment analysis. I am using Python 2.7. NLTK 3.0 and NumPy1.9.1 version.
This is the code :
__author__ = 'karan'
import nltk
import re
import sys
def main():
print("Start");
# getting the stop words
stopWords = open("english.txt","r");
stop_word = stopWords.read().split();
AllStopWrd = []
for wd in stop_word:
AllStopWrd.append(wd);
print("stop words-> ",AllStopWrd);
# sample and also cleaning it
tweet1= 'Love, my new toyí ½í¸í ½í¸#iPhone6. Its good https://twitter.com/Sandra_Ortega/status/513807261769424897/photo/1'
print("old tweet-> ",tweet1)
tweet1 = tweet1.lower()
tweet1 = ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",tweet1).split())
print(tweet1);
tw = tweet1.split()
print(tw)
#tokenize
sentences = nltk.word_tokenize(tweet1)
print("tokenized ->", sentences)
#remove stop words
Otweet =[]
for w in tw:
if w not in AllStopWrd:
Otweet.append(w);
print("sans stop word-> ",Otweet)
# get taggers for neg/pos/inc/dec/inv words
taggers ={}
negWords = open("neg.txt","r");
neg_word = negWords.read().split();
print("ned words-> ",neg_word)
posWords = open("pos.txt","r");
pos_word = posWords.read().split();
print("pos words-> ",pos_word)
incrWords = open("incr.txt","r");
inc_word = incrWords.read().split();
print("incr words-> ",inc_word)
decrWords = open("decr.txt","r");
dec_word = decrWords.read().split();
print("dec wrds-> ",dec_word)
invWords = open("inverse.txt","r");
inv_word = invWords.read().split();
print("inverse words-> ",inv_word)
for nw in neg_word:
taggers.update({nw:'negative'});
for pw in pos_word:
taggers.update({pw:'positive'});
for iw in inc_word:
taggers.update({iw:'inc'});
for dw in dec_word:
taggers.update({dw:'dec'});
for ivw in inv_word:
taggers.update({ivw:'inv'});
print("tagger-> ",taggers)
print(taggers.get('little'))
# get parts of speech
posTagger = [nltk.pos_tag(tw)]
print("posTagger-> ",posTagger)
main();
This is the error that I am getting when running my code:
SyntaxError: Non-ASCII character '\xc3' in file C:/Users/karan/PycharmProjects/mainProject/sentiment.py on line 19, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
How do I fix this error?
I also tried the code using Python 3.4.2 and with NLTK 3.0 and NumPy 1.9.1 but then I get the error:
Traceback (most recent call last):
File "C:/Users/karan/PycharmProjects/mainProject/sentiment.py", line 80, in <module>
main();
File "C:/Users/karan/PycharmProjects/mainProject/sentiment.py", line 72, in main
posTagger = [nltk.pos_tag(tw)]
File "C:\Python34\lib\site-packages\nltk\tag\__init__.py", line 100, in pos_tag
tagger = load(_POS_TAGGER)
File "C:\Python34\lib\site-packages\nltk\data.py", line 779, in load
resource_val = pickle.load(opened_resource)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)
Add the following to the top of your file # coding=utf-8
If you go to the link in the error you can seen the reason why:
Defining the Encoding
Python will default to ASCII as standard encoding if no other
encoding hints are given.
To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file, such as:
# coding=
my views.py code:
#!/usr/bin/python
from django.template import loader, RequestContext
from django.http import HttpResponse
#from skey import find_root_tags, count, sorting_list
from search.models import Keywords
from django.shortcuts import render_to_response as rr
def front_page(request):
if request.method == 'POST' :
from skey import find_root_tags, count, sorting_list
str1 = request.POST['word']
fo = open("/home/pooja/Desktop/xml.txt","r")
for i in range(count.__len__()):
file = fo.readline()
file = file.rstrip('\n')
find_root_tags(file,str1,i)
list.append((file,count[i]))
sorting_list(list)
for name, count1 in list:
s = Keywords(file_name=name,frequency_count=count1)
s.save()
fo.close()
list1 = Keywords.objects.all()
t = loader.get_template('search/results.html')
c = RequestContext({'list1':list1,
})
return HttpResponse(t.render(c))
else :
str1 = ''
list = []
template = loader.get_template('search/front_page.html')
c = RequestContext(request)
response = template.render(c)
return HttpResponse(response)
skey.py has another function called within from find_root_tags():
def find_text(file,str1,i):
str1 = str1.lower()
exp = re.compile(r'<.*?>')
with open(file) as f:
lines = ''.join(line for line in f.readlines())
text_only = exp.sub('',lines).strip()
text_only = text_only.lower()
k = text_only.count(str1) #**line 34**
count[i] = count[i]+k
when I ran my app on server it gave me this error:
UnicodeDecodeError at /search/
'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
Request Method: POST
Request URL: http://127.0.0.1:8000/search/
Django Version: 1.4
Exception Type: UnicodeDecodeError
Exception Value:
'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
Exception Location: /home/pooja/Desktop/mysite/search/skey.py in find_text, line 34
Python Executable: /usr/bin/python
Python Version: 2.6.5
Python Path: ['/home/pooja/Desktop/mysite',
'/usr/lib/python2.6',
'/usr/lib/python2.6/plat-linux2',
'/usr/lib/python2.6/lib-tk',
'/usr/lib/python2.6/lib-old',
'/usr/lib/python2.6/lib-dynload',
'/usr/lib/python2.6/dist-packages',
'/usr/lib/python2.6/dist-packages/PIL',
'/usr/lib/python2.6/dist-packages/gst-0.10',
'/usr/lib/pymodules/python2.6',
'/usr/lib/python2.6/dist-packages/gtk-2.0',
'/usr/lib/pymodules/python2.6/gtk-2.0',
'/usr/local/lib/python2.6/dist-packages'] error :
Can anyone tell me why is it showing this error?How can I remove this error
Please help.
You're mixing Unicode strings and bytestrings. str1 = request.POST['word'] is probably a Unicode string and text_only is a bytestring. Python fails to convert the later to Unicode. You could use codecs.open() to specify the character encoding of the file. See Pragmatic Unicode and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Probable your str1 is in unicode, but text_only is not (on line 34). The next is not a panacea but if this corrects your problem then I am right.
k = u"{0}".format( text_only ).count(str1)