UnicodeEncodeError with my code - python

I have problem with UnicodeEncodeError in my users_information list:
{u'\u0633\u062a\u064a\u062f#nimbuzz.com': {'UserName': u'\u0633\u062a\u064a\u062f#nimbuzz.com', 'Code': 5, 'Notes': '', 'Active': 0, 'Date': '12/07/2014 14:16', 'Password': '560pL390T', 'Email': u'yuyb0y#gmail.com'}}
And I need to run this code to get users information:
def get_users_info(type, source, parameters):
users_registertion_file = 'static/users_information.txt'
fp = open(users_registertion_file, 'r')
users_information = eval(fp.read())
if parameters:
jid = parameters+"#nimbuzz.com"
if users_information.has_key(jid):
reply(type, source, u"User name:\n" +str(users_information[jid]['UserName'])+ u"\nPassword:\n" +str(users_information[jid]['Password'])+ u"\nREG-code:\nP" +str(users_information[jid]['Code'])+ u"\nDate:\n" +str(users_information[jid]['Date'])+ u"\naccount status:\n " +str(users_information[jid]['Active']))
else:
reply(type, source, u"This user " +parameters+ u" not in user list")
else:
reply(type, source, u"write the id after command")
but when I try to get users information I get this error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
I try to unicode the jid using unicode('utf8'):
jid = parameters.encode('utf8')+"#nimbuzz.com"
but I get the same error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
Please how I can solve this problem and as you see the UserName key in the users_information list look like:
u'\u0633\u062a\u064a\u062f#nimbuzz.com'
and the users_information list located in txt file.

You'll not find your user information unless jid is a unicode string. Make sure parameters is a unicode value here, and it'll be easier to use string formatting here:
jid = u"{}#nimbuzz.com".format(parameters)
If you use an encoded bytestring, Python will not find your username in the dictionary as it won't know what encoding you used for the string and won't implicitly decode or encode to make the comparisons.
Next, you cannot call str() on a Unicode value without specifying a codec:
str(users_information[jid]['UserName'])
This is guaranteed to throw an UnicodeEncodeError exception if users_information[jid]['UserName'] contains anything other than ASCII codepoints.
You need to use Unicode values throughout, leave encoding the value to the last possible moment (preferably by leaving it to a library).
You can use string formatting with unicode objects here too:
reply(type, source,
u"User name:\n{0[UserName]}\nPassword:\n{0[Password]}\n"
u"REG-code:\nP{0[Code]}\nDate:\n{0[Date]}\n"
u"account status:\n {0[Active]}".format(users_information[jid]))
This interpolates the various keys from users_information[jid] without calling str on each value.
Note that dict.has_key() has been deprecated; use the in operator to test for a key instead:
if jid in users_information:
Last but not least, don't use eval() if you can avoid it. You should use JSON here for the file format, but if you cannot influence that then at least use ast.literal_eval() on the file contents instead of eval() and limit permissible input to just Python literal syntax:
import ast
# ...
users_information = ast.literal_eval(fp.read())

I had some problem years ago:
jid = parameters+"#nimbuzz.com"
must be
jid = parameters+u"#nimbuzz.com"
and put it at first or second row:
#coding:utf8
Example for Martijn Pieters - on my machine
Python 2.7.8 (default, Jul 1 2014, 17:30:21)
[GCC 4.9.0 20140604 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a=u'asdf'
>>> b='ваывап'
>>> a+b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
>>> c=u'аыиьт'
>>> a+c
u'asdf\u0430\u044b\u0438\u044c\u0442'
>>>

Related

python 3.6 decode dict bytes to json not working

I have a data in the below bytes format,
{'command': 'MESSAGE', 'body': b'\x04\x08{\x0b:\tbody"n\x04\x08{\x08:\tdata{\n:\x0bstdout"\x14output-data\n:\rexitcodei\x00:\x0bstderr"\x00:\x0boutput0:\nerror0:\x0estatusmsg"\x07OK:\x0fstatuscodei\x00:\rsenderid"\x13server1:\thash"%903ff3bf7e9212105df23c92dd8f718a:\x10senderagent"\ntoktok:\x0cmsgtimel+\x07\xf6\xb9hZ:\x0erequestid"%7a358c34f8f9544sd2350c99953d0eec', 'rawHeaders': [('content-length', '264'), ('expires', '1516812860547'), ('destination', '/queue/test.queue'), ('priority', '4'), ('message-id', '12345678'), ('content-type', 'text/plain; charset=UTF-8'), ('timestamp', '1516812790347')]}
and trying to decode and convert it as JSON formatted data but its not working. I tried with data.decode() and data.decode('utf-8') and tried json.loads as well but nothing working.
When I tried with data.decode('utf-8') got below error,
'utf-8' codec can't decode byte 0xf6 in position 215: invalid start byte
and when I tried with data.decode('ascii') get below error,
'ascii' codec can't decode byte 0xa9 in position 215: ordinal not in range(128)
Am confused myself whether am doing right way or anything am missing with this data conversion and parsing.
Update 1:
Just now found that this data is generated using Ruby with PSK security plugin and this message object has .decode! public_method. So is there any way to use the same public_method in python to decode it or if possible using PSK also would be fine.
JSON is a Unicode format, it cannot (easily and transparently) accommodate arbitrary byte strings. What you can easily do is save a blob in some textual format -- base64 is a common choice. But of course, all producers and consumers need to share an understanding of how to decode the blob, rather than just use it as text.
Python 3.5.1 (default, Dec 26 2015, 18:08:53)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> d = {'json': True, 'data': b'\xff\xff\xff'}
>>> json.dumps(d)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
... yada yada yada ...
TypeError: b'\xff\xff\xff' is not JSON serializable
>>> import base64
>>> base64.b64encode(d['data'])
b'////'
>>> base64.b64encode(d['data']).decode('ascii')
'////'
>>> d['data_base64'] = base64.b64encode(d['data']).decode('ascii')
>>> del d['data']
>>> json.dumps(d)
'{"json": true, "data_base64": "////"}'
I very specifically used a different name for the encoded field to avoid having any consumer think that the base64 blob is the actual value for the data member.
Random binary data most definitely isn't valid UTF-8 so obviously cannot be decoded using that codec. UTF-8 is a very specific encoding for Unicode text which cannot really be used for data which isn't exactly that. You usually encode, rather than decode, binary data for transport, and need to have something at the other end decode it back into bytes. Here, that encoding is base64, but anything which can transparently embed binary as text will do.
If that is your data and you are trying to round-trip it through the JSON serializable format, this will do it:
import json
import base64
data = {'command': 'MESSAGE',
'body': b'\x04\x08{\x0b:\tbody"n\x04\x08{\x08:\tdata{\n:\x0bstdout"\x14output-data\n:\rexitcodei\x00:\x0bstderr"\x00:\x0boutput0:\nerror0:\x0estatusmsg"\x07OK:\x0fstatuscodei\x00:\rsenderid"\x13server1:\thash"%903ff3bf7e9212105df23c92dd8f718a:\x10senderagent"\ntoktok:\x0cmsgtimel+\x07\xf6\xb9hZ:\x0erequestid"%7a358c34f8f9544sd2350c99953d0eec',
'rawHeaders': [('content-length', '264'), ('expires', '1516812860547'), ('destination', '/queue/test.queue'), ('priority', '4'), ('message-id', '12345678'), ('content-type', 'text/plain; charset=UTF-8'), ('timestamp', '1516812790347')]}
# Make a copy of the original data and base64 for bytes content.
datat = data.copy()
datat['body'] = base64.encodebytes(datat['body']).decode('ascii')
# Now it serializes
jsondata = json.dumps(datat)
print(jsondata)
# Read it back and decode the base64 field back to its original bytes value
data2 = json.loads(jsondata)
data2['body'] = base64.decodebytes(data2['body'].encode('ascii'))
# For comparison, since the tuples in 'rawHeaders' are read back as lists by JSON,
# convert the list entries back to tuples.
data2['rawHeaders'] = [tuple(x) for x in data2['rawHeaders']]
# Did the data restore correctly?
print(data == data2)
Output:
{"command": "MESSAGE", "body": "BAh7CzoJYm9keSJuBAh7CDoJZGF0YXsKOgtzdGRvdXQiFG91dHB1dC1kYXRhCjoNZXhpdGNvZGVp\nADoLc3RkZXJyIgA6C291dHB1dDA6CmVycm9yMDoOc3RhdHVzbXNnIgdPSzoPc3RhdHVzY29kZWkA\nOg1zZW5kZXJpZCITc2VydmVyMToJaGFzaCIlOTAzZmYzYmY3ZTkyMTIxMDVkZjIzYzkyZGQ4Zjcx\nOGE6EHNlbmRlcmFnZW50Igp0b2t0b2s6DG1zZ3RpbWVsKwf2uWhaOg5yZXF1ZXN0aWQiJTdhMzU4\nYzM0ZjhmOTU0NHNkMjM1MGM5OTk1M2QwZWVj\n", "rawHeaders": [["content-length", "264"], ["expires", "1516812860547"], ["destination", "/queue/test.queue"], ["priority", "4"], ["message-id", "12345678"], ["content-type", "text/plain; charset=UTF-8"], ["timestamp", "1516812790347"]]}
True

Python UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

I'm reading a config file in python getting sections and creating new config files for each section.
However.. I'm getting a decode error because one of the strings contains Español=spain
self.output_file.write( what.replace( " = ", "=", 1 ) )
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
How would I adjust my code to allow for encoded characters such as these? I'm very new to this so please excuse me if this is something simple..
class EqualsSpaceRemover:
output_file = None
def __init__( self, new_output_file ):
self.output_file = new_output_file
def write( self, what ):
self.output_file.write( what.replace( " = ", "=", 1 ) )
def get_sections():
configFilePath = 'C:\\test.ini'
config = ConfigParser.ConfigParser()
config.optionxform = str
config.read(configFilePath)
for section in config.sections():
configdata = {k:v for k,v in config.items(section)}
confignew = ConfigParser.ConfigParser()
cfgfile = open("C:\\" + section + ".ini", 'w')
confignew.add_section(section)
for x in configdata.items():
confignew.set(section,x[0],x[1])
confignew.write( EqualsSpaceRemover( cfgfile ) )
cfgfile.close()
If you use python2 with from __future__ import unicode_literals then every string literal you write is an unicode literal, as if you would prefix every literal with u"...", unless you explicitly write b"...".
This explains why you get an UnicodeDecodeError on this line:
what.replace(" = ", "=", 1)
because what you actually do is
what.replace(u" = ",u"=",1 )
ConfigParser uses plain old str for its items when it reads a file using the parser.read() method, which means what will be a str. If you use unicode as arguments to str.replace(), then the string is converted (decoded) to unicode, the replacement applied and the result returned as unicode. But if what contains characters that can't be decoded to unicode using the default encoding, then you get an UnicodeDecodeError where you wouldn't expect one.
So to make this work you can
use explicit prefixes for byte strings: what.replace(b" = ", b"=", 1)
or remove the unicode_litreals future import.
Generally you shouldn't mix unicode and str (python3 fixes this by making it an error in almost any case). You should be aware that from __future__ import unicode_literals changes every non prefixed literal to unicode and doesn't automatically change your code to work with unicode in all case. Quite the opposite in many cases.

ascii codec cant encode character u'\u2019' ordinal out of range(128)

Python 2.6, upgrading not an option
Script is designed to take fields from a arcgis database and create Insert oracle statements to a text file that can be used at a later date. There are 7500 records after 3000 records it errors out and says the problem lies at.
fieldValue = unicode(str(row.getValue(field.name)),'utf-8',errors='ignore')
I have tried seemly every variation of unicode and encode. I am new to python and really just need someone with experience to look at my code and see where the problem is.
import arcpy
#Where the GDB Table is located
fc = "D:\GIS Data\eMaps\ALDOT\ALDOT_eMaps_SignInventory.gdb/SignLocation"
fields = arcpy.ListFields(fc)
cursor = arcpy.SearchCursor(fc)
#text file output
outFile = open(r"C:\Users\kkieliszek\Desktop\transfer.text", "w")
#insert values into table billboard.sign_inventory
for row in cursor:
outFile.write("Insert into billboard.sign_inventory() Values (")
for field in fields:
fieldValue = unicode(str(row.getValue(field.name)),'utf-8',errors='ignore')
if row.isNull(field.name) or fieldValue.strip() == "" : #if field is Null or a Empty String print NULL
value = "NULL"
outFile.write('"' + value + '",')
else: #print the value in the field
value = str(row.getValue(field.name))
outFile.write('"' + value + '",')
outFile.write("); \n\n ")
outFile.close() # This closes the text file
Error Code:
Traceback (most recent call last):
File "tablemig.py", line 25, in <module>
fieldValue = unicode(str(row.getValue(field.name)),'utf-8',errors='ignore')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 76: ordinal not in range(128)
Never call str() on a unicode object:
>>> str(u'\u2019')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 0: ordinal not in range(128)
To write rows that contain Unicode strings in csv format, use UnicodeWriter instead of formatting the fields manually. It should fix several issues at once.
File TextWrappers make manual encoding/decoding unnecessary.
Assuming the result from the row is a Unicode, simply use the io.open() with the encoding attribute set to the required encoding.
For example:
import io
with io.open(r"C:\Users\kkieliszek\Desktop\transfer.text", "w", encoding='utf-8') as my_file:
my_file(my_unicode)
The problem is that you need to decode/encode unicode/byte string instead of just calling str on it. So, if you have a byte string object, then you need to call encode on it to convert it into unicode object ignoring utf content. On the other hand, if you have unicode object, you need to call decode on it to convert it into byte string ignoring utf again. So, just use this function instead
import re
def remove_unicode(string_data):
""" (str|unicode) -> (str|unicode)
recovers ascii content from string_data
"""
if string_data is None:
return string_data
if isinstance(string_data, str):
string_data = str(string_data.decode('ascii', 'ignore'))
else:
string_data = string_data.encode('ascii', 'ignore')
remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')
return remove_ctrl_chars_regex.sub('', string_data)
fieldValue = remove_unicode(row.getValue(field.name))
It should fix the problem.

python unicode: How can I judge if a string needs to be decoded into utf-8?

I have a function accepting requests from the network. Most of the time, the string passed in is not unicode, but sometimes it is.
I have code to convert everything to unicode, but it reports this error:
message.create(username, unicode(body, "utf-8"), self.get_room_name(),\
TypeError: decoding Unicode is not supported
I think the reason is the 'body' parameter is already unicode, so unicode() raises an exception.
Is there any way to avoid this exception, e.g. judge the type before the conversion?
You do not decode to UTF-8, you encode to UTF-8 or decode from.
You can safely decode from UTF8 even if it's just ASCII. ASCII is a subset of UTF8.
The easiest way to detect if it needs decoding or not is
if not isinstance(data, unicode):
# It's not Unicode!
data = data.decode('UTF8')
You can use either this:
try:
body = unicode(body)
except UnicodeDecodeError:
body = body.decode('utf8')
Or this:
try:
body = unicode(body, 'utf8')
except TypeError:
body = unicode(body)
Mark Pilgrim wrote a Python library to guess text encodings:
http://chardet.feedparser.org/
On Unicode and UTF-8, the first two sections of chapter 4 of his book ‘Dive into Python 3’ are pretty great:
http://diveintopython3.org/strings.html
This is what I use:
def to_unicode_or_bust(obj, encoding='utf-8'):
if isinstance(obj, basestring):
if not isinstance(obj, unicode):
obj = unicode(obj, encoding)
return obj
It's taken from this presentation: http://farmdev.com/talks/unicode/
And this is a sample code that uses it:
def hash_it_safe(s):
try:
s = to_unicode_or_bust(s)
return hash_it_basic(s)
except UnicodeDecodeError:
return hash_it_basic(s)
except UnicodeEncodeError:
assert type(s) is unicode
return hash_it_basic(s.encode('utf-8'))
Anyone have some thoughts on how to improve this code? ;)

How can I make this Python2.6 function work with Unicode?

I've got this function, which I modified from material in chapter 1 of the online NLTK book. It's been very useful to me but, despite reading the chapter on Unicode, I feel just as lost as before.
def openbookreturnvocab(book):
fileopen = open(book)
rawness = fileopen.read()
tokens = nltk.wordpunct_tokenize(rawness)
nltktext = nltk.Text(tokens)
nltkwords = [w.lower() for w in nltktext]
nltkvocab = sorted(set(nltkwords))
return nltkvocab
When I tried it the other day on Also Sprach Zarathustra, it clobbered words with an umlat over the o's and u's. I'm sure some of you will know why that happened. I'm also sure that it's quite easy to fix. I know that it just has to do with calling a function that re-encodes the tokens into unicode strings. If so, that it seems to me it might not happen inside that function definition at all, but here, where I prepare to write to file:
def jotindex(jotted, filename, readmethod):
filemydata = open(filename, readmethod)
jottedf = '\n'.join(jotted)
filemydata.write(jottedf)
filemydata.close()
return 0
I heard that what I had to do was encode the string into unicode after reading it from the file. I tried amending the function like so:
def openbookreturnvocab(book):
fileopen = open(book)
rawness = fileopen.read()
unirawness = rawness.decode('utf-8')
tokens = nltk.wordpunct_tokenize(unirawness)
nltktext = nltk.Text(tokens)
nltkwords = [w.lower() for w in nltktext]
nltkvocab = sorted(set(nltkwords))
return nltkvocab
But that brought this error, when I used it on Hungarian. When I used it on German, I had no errors.
>>> import bookroutines
>>> elles1 = bookroutines.openbookreturnvocab("lk1-les1")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "bookroutines.py", line 9, in openbookreturnvocab
nltktext = nltk.Text(tokens)
File "/usr/lib/pymodules/python2.6/nltk/text.py", line 285, in __init__
self.name = " ".join(map(str, tokens[:8])) + "..."
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128)
I fixed the function that files the data like so:
def jotindex(jotted, filename, readmethod):
filemydata = open(filename, readmethod)
jottedf = u'\n'.join(jotted)
filemydata.write(jottedf)
filemydata.close()
return 0
However, that brought this error, when I tried to file the German:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "bookroutines.py", line 23, in jotindex
filemydata.write(jottedf)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 414: ordinal not in range(128)
>>>
...which is what you get when you try to write the u'\n'.join'ed data.
>>> jottedf = u'/n'.join(elles1)
>>> filemydata.write(jottedf)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 504: ordinal not in range(128)
For each string that you read from your file, you can convert them to unicode by calling rawness.decode('utf-8'), if you have the text in UTF-8. You will end up with unicode objects. Also, I don't know what "jotted" is, but you may want to make sure it's a unicode object and use u'\n'.join(jotted) instead.
Update:
It appears that the NLTK library doesn't like unicode objects. Fine, then you have to make sure that you are using str instances with UTF-8 encoded text. Try using this:
tokens = nltk.wordpunct_tokenize(unirawness)
nltktext = nltk.Text([token.encode('utf-8') for token in tokens])
and this:
jottedf = u'\n'.join(jotted)
filemydata.write(jottedf.encode('utf-8'))
but if jotted is really a list of UTF-8-encoded str, then you don't need this and this should be enough:
jottedf = '\n'.join(jotted)
filemydata.write(jottedf)
By the way, it looks as though NLTK isn't very cautious with respect to unicode and encoding (at least, the demos). Better be careful and check that it has processed your tokens correctly. Also, and this may have caused the fact that you get errors with Hungarian text and not German text, check your encodings.

Categories

Resources