How to handle encoding in Python 2.7 and SQLAlchemy 🏴‍☠️ - python

I have written a code in Python 3.5, where I was using Tweepy & SQLAlchemy & the following lines to load Tweets into a database and it worked well:
twitter = Twitter(str(tweet.user.name).encode('utf8'), str(tweet.text).encode('utf8'))
session.add(twitter)
session.commit()
Using the same code now in Python 2.7 raises an Error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in
position 139: ordinal not in range(128)
Whats the solution? My MySQL Configuration is the following one:
Server side --> utf8mb4 encoding
Client side --> create_engine('mysql+pymysql://abc:def#abc/def', encoding='utf8', convert_unicode=True)):
UPDATE
It seems that there is no solution, at least not with Python 2.7 + SQLAlchemy. Here is what I found out so far and if I am wrong, please correct me.
Tweepy, at least in Python 2.7, returns unicode type objects.
In Python 2.7: tweet = u'☠' is a <'unicode' type>
In Python 3.5: tweet = u'☠' is a <'str' class>
This means Python 2.7 will give me an 'UnicodeEncodeError' if I do str(tweet) because Python 2.7 then tries to encode this character '☠' into ASCII, which is not possible, because ASCII can only handle this basic characters.
Conclusion:
Using just this statement tweet.user.name in the SQLAlchemy line gives me the following error:
UnicodeEncodeError: 'latin-1' codec can't encode characters in
position 0-4: ordinal not in range(256)
Using either this statement tweet.user.name.encode('utf-8') or this one str(tweet.user.name.encode('utf-8')) in the SQLAlchemy line should actually work the right way, but it shows me unencoded characters on the database side:
ð´ââ ï¸Jack Sparrow
This is what I want it to show:
Printed: 🏴‍☠️ Jack Sparrow
Special characters unicode: u'\U0001f3f4\u200d\u2620\ufe0f'
Special characters UTF-8 encoding: '\xf0\x9f\x8f\xb4\xe2\x80\x8d\xe2\x98\xa0\xef\xb8\x8f'

Do not use any encode/decode functions; they only compound the problems.
Do set the connection to be UTF-8.
Do set the column/table to utf8mb4 instead of utf8.
Do use # -*- coding: utf-8 -*- at the beginning of Python code.
More Python tips Note that that has a link to "Python 2.7 issues; improvements in Python 3".

Related

'ascii' codec can't encode character when redirecting Python script to file through Bash [duplicate]

I have a python script that grabs a bunch of recent tweets from the twitter API and dumps them to screen. It works well, but when I try to direct the output to a file something strange happens and a print statement causes an exception:
> ./tweets.py > tweets.txt
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position 61: ordinal not in range(128)
I understand that the problem is with a UTF-8 character in one of the tweets that doesn't translate well to ASCII, but what is a simple way to dump the output to a file? Do I fix this in the python script or is there a way to coerce it at the commandline?
BTW, the script was written in Python2.
Without modifying the script, you can just set the environment variable PYTHONIOENCODING=utf8 and Python will assume that encoding when redirecting to a file.
References:
https://docs.python.org/2.7/using/cmdline.html#envvar-PYTHONIOENCODING
https://docs.python.org/3.3/using/cmdline.html#envvar-PYTHONIOENCODING
You may need encode the unicode object with .encode('utf-8')
In your python file append this to first line
# -*- coding: utf-8 -*-
If your script file is working standalone, append it to second line
#!/usr/local/bin/python
# -*- coding: utf-8 -*-
Here is the document: PEP 0263

robot framework: difference in encoding between old formatting %s and new formatting

I have a library with keyword. Keyword write some message to a test documentation.
This python file in utf-8, and has needed heading
# -*- coding: utf-8 -*-
*.robot files are in utf-8
Execution of this keyword in robot file with non-ascii symbols gives:
If keyword has "%s" % msg: no error, log file gives russian message, normally displayed.
If keyword has "{}".format(msg) or "{!s}".format(msg): I get the error UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-10: ordinal not in range(128)
As you can see, I change only old python formatting to the new way. But how to fix this problem sith non-asc displaying error with new way, not using old style formatting?
Try to use the Str.decode(encoding='UTF-8',errors='strict') method. See Python doc.
Example:
"{}".format(msg.decode(errors='replace'))

How to split string containing non-ascii characters in views?

In Python 2.7 and Django 1.8 postman views, I have this function in postman views:
def mod1(message):
print 'message is', message #bob>mary:سلام
message = str(message) #without this I get 'Message' object has no attribute 'split'
sndr = message.split('>')[0]
print 'snder', sndr
#...
Which give this error
'ascii' codec can't decode byte 0xd8 in position 15: ordinal not in range(128)
Strangely, I can do the split in Python terminal.
I have also added # -*- coding: utf-8 -*- at top of the views.
Appreciate your hints to solve this.
Message contains some unicode text.
If you are not careful to print it while encoding it properly, then you will get those errors.
They are because Python by default will try to encode using only the ASCII codec, which cannot handle the arabic characters. Usually you want to tell it to encode using UTF-8 or a similarly capable codec instead.
str(something).encode('UTF-8')

UnicodeDecodeError error writing .xlsx file using xlsxwriter

I am trying to write about 1000 rows to a .xlsx file from my python application. The data is basically a combination of integers and strings. I am getting intermittent error while running wbook.close() command. The error is the following:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15:
ordinal not in range(128)
My data does not have anything in unicode. I am wondering why the decoder is being at all. Has anyone noticed this problem?
0xc3 is "À". So what you need to do is change the encoding. Use the decode() method.
string.decode('utf-8')
Also depending on your needs and uses you could add
# -*- coding: utf-8 -*-
at the beginning of your script, but only if you are sure that the encoding will not interfere and break something else.
As Alex Hristov points out you have some non-ascii data in your code that needs to be encoded as UTF-8 for Excel.
See the following examples from the docs which each have instructions on handling UTF-8 with XlsxWriter in different scenarios:
Example: Simple Unicode with Python 2
Example: Simple Unicode with Python 3
Example: Unicode - Polish in UTF-8

Why does my Python program get UnicodeDecodeError in IntelliJ but is OK from the command line?

I have a simple program that loads a .json file which contains a funny character. The program (see below) runs fine in Terminal but gets this error in IntelliJ:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
2: ordinal not in range(128)
The crucial code is:
with open(jsonFileName) as f:
jsonData = json.load(f)
if I replace the open with:
with open(jsonFileName, encoding='utf-8') as f:
Then it works in both IntelliJ and Terminal. I'm still new to Python and the IntelliJ plugin, and I don't understand why they're different. I thought sys.path might be different, but the output makes me think that's not the cause. Could someone please explain? Thanks!
Versions:
OS: Mac OS X 10.7.4 (also tested on 10.6.8)
Python 3.2.3 (v3.2.3:3d0686d90f55, Apr 10 2012, 11:25:50) /Library/Frameworks/Python.framework/Versions/3.2/bin/python3.2
IntelliJ: 11.1.3 Ultimate
Files (2):
1. unicode-error-demo.py
#!/usr/bin/python
import json
from pprint import pprint as pp
import sys
def main():
if len(sys.argv) is not 2:
print(sys.argv[0], "takes one arg: a .json file")
return
jsonFileName = sys.argv[1]
print("sys.path:")
pp(sys.path)
print("processing", jsonFileName)
# with open(jsonFileName) as f: # OK in Terminal, but BUG in IntelliJ: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)
with open(jsonFileName, encoding='utf-8') as f: # OK in both
jsonData = json.load(f)
pp(jsonData)
if __name__ == "__main__":
main()
2. encode-temp.json
["™"]
The JSON .load() function expects Unicode data, not raw bytes. Python automatically tries to decode the byte string to a Unicode string for you using a default codec (in your case ASCII), and fails. By opening the file with the UTF-8 codec, Python makes an explicit conversion for you. See the open() function, which states:
In text mode, if encoding is not specified the encoding used is platform dependent.
The encoding that would be used is determined as follows:
Try os.device_encoding() to see if there is a terminal encoding.
Use locale.getpreferredencoding() function, which depends on the environment you run your code in. The do_setlocale of that function is set to False.
Use 'ASCII' as a default if both methods have returned None.
This is all done in C, but it's python equivalent would be:
if encoding is None:
encoding = os.device_encoding()
if encoding is None:
encoding = locale.getpreferredencoding(False)
if encoding is None:
encoding = 'ASCII'
So when you run your program in a terminal, os.deviceencoding() returns 'UTF-8', but when running under IntelliJ there is no terminal, and if no locale is set either, python uses 'ASCII'.
The Python Unicode HOWTO tells you all about the difference between unicode strings and bytestrings, as well as encodings. Another essential article on the subject is Joel Spolsky's Absolute Minimum Unicode knowledge article.
Python 2.x has strings and unicode strings. The basic strings are encoded with ASCII. ASCII uses only 7 bits/char, which allow to encode 128 characters, while modern UTF-8 uses up to 4 bytes/char. UTF-8 is compatible with ASCII (so that any ASCII-encoded string is a valid UTF-8 string), but not the other way round.
Apparently, your file name contains non-ASCII characters. And python by default wants to read it in as simple ASCII-encoded string, spots a non-ASCII character (its first bit is not 0 as it's 0xe2) and says, 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128).
Has nothing to do with python, but still my favourite tutorial about encodings:
http://hektor.umcs.lublin.pl/~mikosmul/computing/articles/linux-unicode.html

Categories

Resources