To encode the URI, I used urllib.quote("schönefeld") but when some non-ascii characters exists in string, it thorws
KeyError: u'\xe9'
Code: return ''.join(map(quoter, s))
My input strings are köln, brønshøj, schönefeld etc.
When I tried just printing statements in windows(Using python2.7, pyscripter IDE). But in linux it raises exception (I guess platform doesn't matter).
This is what I am trying:
from commands import getstatusoutput
queryParams = "schönefeld";
cmdString = "http://baseurl" + quote(queryParams)
print getstatusoutput(cmdString)
Exploring the issue reason:
in urllib.quote(), actually exception being throwin at return ''.join(map(quoter, s)).
The code in urllib is:
def quote(s, safe='/'):
if not s:
if s is None:
raise TypeError('None object cannot be quoted')
return s
cachekey = (safe, always_safe)
try:
(quoter, safe) = _safe_quoters[cachekey]
except KeyError:
safe_map = _safe_map.copy()
safe_map.update([(c, c) for c in safe])
quoter = safe_map.__getitem__
safe = always_safe + safe
_safe_quoters[cachekey] = (quoter, safe)
if not s.rstrip(safe):
return s
return ''.join(map(quoter, s))
The reason for exception is in ''.join(map(quoter, s)), for every element in s, quoter function will be called and finally the list will be joined by '' and returned.
For non-ascii char è, the equivalent key will be %E8 which presents in _safe_map variable. But when I am calling quote('è'), it searches for the key \xe8. So that the key does not exist and exception thrown.
So, I just modifed s = [el.upper().replace("\\X","%") for el in s] before calling ''.join(map(quoter, s)) within try-except block. Now it works fine.
But I am annoying what I have done is correct approach or it will create any other issue?
And also I do have 200+ instances of linux which is very tough to deploy this fix in all instances.
You are trying to quote Unicode data, so you need to decide how to turn that into URL-safe bytes.
Encode the string to bytes first. UTF-8 is often used:
>>> import urllib
>>> urllib.quote(u'sch\xe9nefeld')
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py:1268: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
return ''.join(map(quoter, s))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1268, in quote
return ''.join(map(quoter, s))
KeyError: u'\xe9'
>>> urllib.quote(u'sch\xe9nefeld'.encode('utf8'))
'sch%C3%A9nefeld'
However, the encoding depends on what the server will accept. It's best to stick to the encoding the original form was sent with.
By just converting the string to unicode I resolved the issue.
here is the snippet:
try:
unicode(mystring, "ascii")
except UnicodeError:
mystring = unicode(mystring, "utf-8")
else:
pass
Detailed description of solution can be found at http://effbot.org/pyfaq/what-does-unicodeerror-ascii-decoding-encoding-error-ordinal-not-in-range-128-mean.htm
I had the exact same error as #underscore but in my case the problem was that map(quoter,s) tried to look for the key u'\xe9' which was not in the _safe_map. However \xe9 was, so I solved the issue by replacing u'\xe9' by \xe9 in s.
Moreover, shouldn't the return statement be within the try/except ? I also had to change this to completely solve the problem.
Related
I am trying to perform a rethinkdb match query with an escaped unicode user provided search param:
import re
from rethinkdb import RethinkDB
r = RethinkDB()
search_value = u"\u05e5" # provided by user via flask
search_value_escaped = re.escape(search_value) # results in u'\\\u05e5' ->
# when encoded with "utf-8" gives "\ץ" as expected.
conn = rethinkdb.connect(...)
results_cursor_a = r.db(...).table(...).order_by(index="id").filter(
lambda doc: doc.coerce_to("string").match(search_value)
).run(conn) # search_value works fine
results_cursor_b = r.db(...).table(...).order_by(index="id").filter(
lambda doc: doc.coerce_to("string").match(search_value_escaped)
).run(conn) # search_value_escaped spits an error
The error for search_value_escaped is the following:
ReqlQueryLogicError: Error in regexp `\ץ` (portion `\ץ`): invalid escape sequence: \ץ in:
r.db(...).table(...).order_by(index="id").filter(lambda var_1: var_1.coerce_to('string').match(u'\\\u05e5m'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I tried encoding with "utf-8" before/after re.escape() but same results with different errors. What am I messing? Is it something in my code or some kind of a bug?
EDIT: .coerce_to('string') converts the document to "utf-8" encoded string. RethinkDB also converts the query to "utf-8" and then it matches them hence the first query works even though it looks like a unicde match inside a string.
From what it looks like RethinkDB rejects escaped unicode characters so I wrote a simple workaround with a custom escape without implementing my own logic of replacing characters (in fear that I must miss one and create a security issue).
import re
def no_unicode_escape(u):
escaped_list = []
for i in u:
if ord(i) < 128:
escaped_list.append(re.escape(i))
else:
escaped_list.append(i)
rv = "".join(escaped_list)
return rv
or a one-liner:
import re
def no_unicode_escape(u):
return "".join(re.escape(i) if ord(i) < 128 else i for i in u)
Which yields the required result of escaping "dangerous" characters and works with RethinkDB as I wanted.
Python 2.6, upgrading not an option
Script is designed to take fields from a arcgis database and create Insert oracle statements to a text file that can be used at a later date. There are 7500 records after 3000 records it errors out and says the problem lies at.
fieldValue = unicode(str(row.getValue(field.name)),'utf-8',errors='ignore')
I have tried seemly every variation of unicode and encode. I am new to python and really just need someone with experience to look at my code and see where the problem is.
import arcpy
#Where the GDB Table is located
fc = "D:\GIS Data\eMaps\ALDOT\ALDOT_eMaps_SignInventory.gdb/SignLocation"
fields = arcpy.ListFields(fc)
cursor = arcpy.SearchCursor(fc)
#text file output
outFile = open(r"C:\Users\kkieliszek\Desktop\transfer.text", "w")
#insert values into table billboard.sign_inventory
for row in cursor:
outFile.write("Insert into billboard.sign_inventory() Values (")
for field in fields:
fieldValue = unicode(str(row.getValue(field.name)),'utf-8',errors='ignore')
if row.isNull(field.name) or fieldValue.strip() == "" : #if field is Null or a Empty String print NULL
value = "NULL"
outFile.write('"' + value + '",')
else: #print the value in the field
value = str(row.getValue(field.name))
outFile.write('"' + value + '",')
outFile.write("); \n\n ")
outFile.close() # This closes the text file
Error Code:
Traceback (most recent call last):
File "tablemig.py", line 25, in <module>
fieldValue = unicode(str(row.getValue(field.name)),'utf-8',errors='ignore')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 76: ordinal not in range(128)
Never call str() on a unicode object:
>>> str(u'\u2019')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 0: ordinal not in range(128)
To write rows that contain Unicode strings in csv format, use UnicodeWriter instead of formatting the fields manually. It should fix several issues at once.
File TextWrappers make manual encoding/decoding unnecessary.
Assuming the result from the row is a Unicode, simply use the io.open() with the encoding attribute set to the required encoding.
For example:
import io
with io.open(r"C:\Users\kkieliszek\Desktop\transfer.text", "w", encoding='utf-8') as my_file:
my_file(my_unicode)
The problem is that you need to decode/encode unicode/byte string instead of just calling str on it. So, if you have a byte string object, then you need to call encode on it to convert it into unicode object ignoring utf content. On the other hand, if you have unicode object, you need to call decode on it to convert it into byte string ignoring utf again. So, just use this function instead
import re
def remove_unicode(string_data):
""" (str|unicode) -> (str|unicode)
recovers ascii content from string_data
"""
if string_data is None:
return string_data
if isinstance(string_data, str):
string_data = str(string_data.decode('ascii', 'ignore'))
else:
string_data = string_data.encode('ascii', 'ignore')
remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')
return remove_ctrl_chars_regex.sub('', string_data)
fieldValue = remove_unicode(row.getValue(field.name))
It should fix the problem.
I have a class chunk with text fields title and text. When I want to print them, I get (surprise, surprise!) UnicodeDecodeError. It gives me an error when I try to format an output string, but when I just concatenate text and title and return it, I get no error:
class Chunk:
# init, fields, ...
# this implementation will give me an error
def __str__( self ):
return u'{0} {1}'.format ( enc(self.text), enc(self.title) )
# but this is OK - all is printed without error
def __str__( self ):
return enc(self.text) + enc(self.title)
def enc(x):
return x.encode('utf-8','ignore') # tried many combinations of arguments...
c = Chunk()
c.text, c.title = ... # feed from external file
print c
Bum! Error!
return u'{0} {1}'.format ( enc(self.text), enc(self.title) )
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2844: ordinal not in range(128)
I think I used all the possible combinations of encode/decode/utf-8/ascii/replace/ignore/...
(the python unicode issue is really irritating!)
You should override __unicode__, not __str__, when you return a unicode.
There is no need to call .encode(), since the input is already a unicode. Just write
def __unicode__(self):
return u"{0} {1}".format(self.text, self.title)
The simplest way to avoid 2.x python's unicode problem is to set overall encoding to utf-8, or such a problems will be constantly arise in a sudden places:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
I have a function accepting requests from the network. Most of the time, the string passed in is not unicode, but sometimes it is.
I have code to convert everything to unicode, but it reports this error:
message.create(username, unicode(body, "utf-8"), self.get_room_name(),\
TypeError: decoding Unicode is not supported
I think the reason is the 'body' parameter is already unicode, so unicode() raises an exception.
Is there any way to avoid this exception, e.g. judge the type before the conversion?
You do not decode to UTF-8, you encode to UTF-8 or decode from.
You can safely decode from UTF8 even if it's just ASCII. ASCII is a subset of UTF8.
The easiest way to detect if it needs decoding or not is
if not isinstance(data, unicode):
# It's not Unicode!
data = data.decode('UTF8')
You can use either this:
try:
body = unicode(body)
except UnicodeDecodeError:
body = body.decode('utf8')
Or this:
try:
body = unicode(body, 'utf8')
except TypeError:
body = unicode(body)
Mark Pilgrim wrote a Python library to guess text encodings:
http://chardet.feedparser.org/
On Unicode and UTF-8, the first two sections of chapter 4 of his book ‘Dive into Python 3’ are pretty great:
http://diveintopython3.org/strings.html
This is what I use:
def to_unicode_or_bust(obj, encoding='utf-8'):
if isinstance(obj, basestring):
if not isinstance(obj, unicode):
obj = unicode(obj, encoding)
return obj
It's taken from this presentation: http://farmdev.com/talks/unicode/
And this is a sample code that uses it:
def hash_it_safe(s):
try:
s = to_unicode_or_bust(s)
return hash_it_basic(s)
except UnicodeDecodeError:
return hash_it_basic(s)
except UnicodeEncodeError:
assert type(s) is unicode
return hash_it_basic(s.encode('utf-8'))
Anyone have some thoughts on how to improve this code? ;)
I've got this function, which I modified from material in chapter 1 of the online NLTK book. It's been very useful to me but, despite reading the chapter on Unicode, I feel just as lost as before.
def openbookreturnvocab(book):
fileopen = open(book)
rawness = fileopen.read()
tokens = nltk.wordpunct_tokenize(rawness)
nltktext = nltk.Text(tokens)
nltkwords = [w.lower() for w in nltktext]
nltkvocab = sorted(set(nltkwords))
return nltkvocab
When I tried it the other day on Also Sprach Zarathustra, it clobbered words with an umlat over the o's and u's. I'm sure some of you will know why that happened. I'm also sure that it's quite easy to fix. I know that it just has to do with calling a function that re-encodes the tokens into unicode strings. If so, that it seems to me it might not happen inside that function definition at all, but here, where I prepare to write to file:
def jotindex(jotted, filename, readmethod):
filemydata = open(filename, readmethod)
jottedf = '\n'.join(jotted)
filemydata.write(jottedf)
filemydata.close()
return 0
I heard that what I had to do was encode the string into unicode after reading it from the file. I tried amending the function like so:
def openbookreturnvocab(book):
fileopen = open(book)
rawness = fileopen.read()
unirawness = rawness.decode('utf-8')
tokens = nltk.wordpunct_tokenize(unirawness)
nltktext = nltk.Text(tokens)
nltkwords = [w.lower() for w in nltktext]
nltkvocab = sorted(set(nltkwords))
return nltkvocab
But that brought this error, when I used it on Hungarian. When I used it on German, I had no errors.
>>> import bookroutines
>>> elles1 = bookroutines.openbookreturnvocab("lk1-les1")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "bookroutines.py", line 9, in openbookreturnvocab
nltktext = nltk.Text(tokens)
File "/usr/lib/pymodules/python2.6/nltk/text.py", line 285, in __init__
self.name = " ".join(map(str, tokens[:8])) + "..."
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128)
I fixed the function that files the data like so:
def jotindex(jotted, filename, readmethod):
filemydata = open(filename, readmethod)
jottedf = u'\n'.join(jotted)
filemydata.write(jottedf)
filemydata.close()
return 0
However, that brought this error, when I tried to file the German:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "bookroutines.py", line 23, in jotindex
filemydata.write(jottedf)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 414: ordinal not in range(128)
>>>
...which is what you get when you try to write the u'\n'.join'ed data.
>>> jottedf = u'/n'.join(elles1)
>>> filemydata.write(jottedf)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 504: ordinal not in range(128)
For each string that you read from your file, you can convert them to unicode by calling rawness.decode('utf-8'), if you have the text in UTF-8. You will end up with unicode objects. Also, I don't know what "jotted" is, but you may want to make sure it's a unicode object and use u'\n'.join(jotted) instead.
Update:
It appears that the NLTK library doesn't like unicode objects. Fine, then you have to make sure that you are using str instances with UTF-8 encoded text. Try using this:
tokens = nltk.wordpunct_tokenize(unirawness)
nltktext = nltk.Text([token.encode('utf-8') for token in tokens])
and this:
jottedf = u'\n'.join(jotted)
filemydata.write(jottedf.encode('utf-8'))
but if jotted is really a list of UTF-8-encoded str, then you don't need this and this should be enough:
jottedf = '\n'.join(jotted)
filemydata.write(jottedf)
By the way, it looks as though NLTK isn't very cautious with respect to unicode and encoding (at least, the demos). Better be careful and check that it has processed your tokens correctly. Also, and this may have caused the fact that you get errors with Hungarian text and not German text, check your encodings.