How to split string containing non-ascii characters in views? - python

In Python 2.7 and Django 1.8 postman views, I have this function in postman views:
def mod1(message):
print 'message is', message #bob>mary:سلام
message = str(message) #without this I get 'Message' object has no attribute 'split'
sndr = message.split('>')[0]
print 'snder', sndr
#...
Which give this error
'ascii' codec can't decode byte 0xd8 in position 15: ordinal not in range(128)
Strangely, I can do the split in Python terminal.
I have also added # -*- coding: utf-8 -*- at top of the views.
Appreciate your hints to solve this.

Message contains some unicode text.
If you are not careful to print it while encoding it properly, then you will get those errors.
They are because Python by default will try to encode using only the ASCII codec, which cannot handle the arabic characters. Usually you want to tell it to encode using UTF-8 or a similarly capable codec instead.
str(something).encode('UTF-8')

Related

How to handle encoding in Python 2.7 and SQLAlchemy 🏴‍☠️

I have written a code in Python 3.5, where I was using Tweepy & SQLAlchemy & the following lines to load Tweets into a database and it worked well:
twitter = Twitter(str(tweet.user.name).encode('utf8'), str(tweet.text).encode('utf8'))
session.add(twitter)
session.commit()
Using the same code now in Python 2.7 raises an Error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in
position 139: ordinal not in range(128)
Whats the solution? My MySQL Configuration is the following one:
Server side --> utf8mb4 encoding
Client side --> create_engine('mysql+pymysql://abc:def#abc/def', encoding='utf8', convert_unicode=True)):
UPDATE
It seems that there is no solution, at least not with Python 2.7 + SQLAlchemy. Here is what I found out so far and if I am wrong, please correct me.
Tweepy, at least in Python 2.7, returns unicode type objects.
In Python 2.7: tweet = u'☠' is a <'unicode' type>
In Python 3.5: tweet = u'☠' is a <'str' class>
This means Python 2.7 will give me an 'UnicodeEncodeError' if I do str(tweet) because Python 2.7 then tries to encode this character '☠' into ASCII, which is not possible, because ASCII can only handle this basic characters.
Conclusion:
Using just this statement tweet.user.name in the SQLAlchemy line gives me the following error:
UnicodeEncodeError: 'latin-1' codec can't encode characters in
position 0-4: ordinal not in range(256)
Using either this statement tweet.user.name.encode('utf-8') or this one str(tweet.user.name.encode('utf-8')) in the SQLAlchemy line should actually work the right way, but it shows me unencoded characters on the database side:
ð´ââ ï¸Jack Sparrow
This is what I want it to show:
Printed: 🏴‍☠️ Jack Sparrow
Special characters unicode: u'\U0001f3f4\u200d\u2620\ufe0f'
Special characters UTF-8 encoding: '\xf0\x9f\x8f\xb4\xe2\x80\x8d\xe2\x98\xa0\xef\xb8\x8f'
Do not use any encode/decode functions; they only compound the problems.
Do set the connection to be UTF-8.
Do set the column/table to utf8mb4 instead of utf8.
Do use # -*- coding: utf-8 -*- at the beginning of Python code.
More Python tips Note that that has a link to "Python 2.7 issues; improvements in Python 3".

robot framework: difference in encoding between old formatting %s and new formatting

I have a library with keyword. Keyword write some message to a test documentation.
This python file in utf-8, and has needed heading
# -*- coding: utf-8 -*-
*.robot files are in utf-8
Execution of this keyword in robot file with non-ascii symbols gives:
If keyword has "%s" % msg: no error, log file gives russian message, normally displayed.
If keyword has "{}".format(msg) or "{!s}".format(msg): I get the error UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-10: ordinal not in range(128)
As you can see, I change only old python formatting to the new way. But how to fix this problem sith non-asc displaying error with new way, not using old style formatting?
Try to use the Str.decode(encoding='UTF-8',errors='strict') method. See Python doc.
Example:
"{}".format(msg.decode(errors='replace'))

UnicodeDecodeError error writing .xlsx file using xlsxwriter

I am trying to write about 1000 rows to a .xlsx file from my python application. The data is basically a combination of integers and strings. I am getting intermittent error while running wbook.close() command. The error is the following:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15:
ordinal not in range(128)
My data does not have anything in unicode. I am wondering why the decoder is being at all. Has anyone noticed this problem?
0xc3 is "À". So what you need to do is change the encoding. Use the decode() method.
string.decode('utf-8')
Also depending on your needs and uses you could add
# -*- coding: utf-8 -*-
at the beginning of your script, but only if you are sure that the encoding will not interfere and break something else.
As Alex Hristov points out you have some non-ascii data in your code that needs to be encoded as UTF-8 for Excel.
See the following examples from the docs which each have instructions on handling UTF-8 with XlsxWriter in different scenarios:
Example: Simple Unicode with Python 2
Example: Simple Unicode with Python 3
Example: Unicode - Polish in UTF-8

Figuring out unicode: 'ascii' codec can't decode

I currently use Sublime 2 and run my python code there.
When I try to run this code. I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
6: ordinal not in range(128)
# -*- coding: utf-8 -*-
s = unicode('abcdefö')
print s
I have been reading the python documentation on unicode and as far as I understand this should work, or is it the console that's not working
Edit: Using s = u'abcdefö' as a string produces almost the same result. The result I get is
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 6: ordinal not in range(128)
What happens is that unicode('abcdefö') tries to decode the encoded string to unicode during runtime. The coding: utf-8 line only tells Python that the source file is encoded in utf8. When the script runs it has been compiled and string has been stored as a encoded string. So when Python tries to decode the string it uses ascii by default. As the string is actually utf8 encoded this fails.
You can do s = u'abcdefö' which tells the compiler to decode the string with the encoding declared for the file and store it as unicode. s = unicode('abcdefö', 'utf8') or s = 'abcdefö'.decode('utf8') would do the same thing during runtime.
However does not necessarily mean that you can print s now. First the internal unicode string has to be encoded in a character set that the stdout (the console/editor/IDE) can actually display. Sadly often Python fails at figuring out the right character set and defaults to ascii again and you get an error when the string contains non-ascii characters. The Python Wiki knows a few ways to set up stdout properly.
You need to mark the string as a unicode string:
s = u'abcdefö'
s = 'abcdefö'
DO NOT TRY unicode() if string is already in unicode. i.e. unicode(s) is wrong.
IF type(s) == str but contains unicode characters:
First convert to unicode
str_val = unicode(s,'utf-8’)
str_val = unicode(s,'utf-8’,’replace')
Finally encode to string
str_val.encode('utf-8')
Now you can print:
print s

Parsing unicode input using python json.loads

What is the best way to load JSON Strings in Python?
I want to use json.loads to process unicode like this:
import json
json.loads(unicode_string_to_load)
I also tried supplying 'encoding' parameter with value 'utf-16', but the error did not go away.
Full SSCCE with error:
# -*- coding: utf-8 -*-
import json
value = '{"foo" : "bar"}'
print(json.loads(value)['foo']) #This is correct, prints 'bar'
some_unicode = unicode("degradé")
#last character is latin e with acute "\xe3\xa9"
value = '{"foo" : "' + some_unicode + '"}'
print(json.loads(value)['foo']) #incorrect, throws error
Error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
6: ordinal not in range(128)
I typecasting the string into unicode string using 'latin-1' fixed the error:
UnicodeDecodeError: 'utf16' codec can't decode byte 0x38 in
position 6: truncated data
Fixed code:
import json
ustr_to_load = unicode(str_to_load, 'latin-1')
json.loads(ustr_to_load)
And then the error is not thrown.
The OP clarifies (in a comment!)...:
Source data is huge unicode encoded
string
Then you have to know which of the many unicode encodings it uses -- clearly not 'utf-16', since that failed, but there are so many others -- 'utf-8', 'iso-8859-15', and so forth. You either try them all until one works, or print repr(str_to_load[:80]) and paste what it shows as an edit of your question, so we can guess on your behalf!-).
The simplest way I have found is
import simplejson as json
that way your code remains the same
json.loads(str_to_load)
reference: https://simplejson.readthedocs.org/en/latest/
With django you can use SimpleJSON and use loads instead of just load.
from django.utils import simplejson
simplejson.loads(str_to_load, "utf-8")

Categories

Resources