Python with MySql unicode problems

Python with MySql unicode problems - python

I need to call MySQL stored procedure from my python script. As one of parameters I'm passing a unicode string (Russian language), but I get an error;
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)
My script:
self.db=MySQLdb.connect("localhost", "usr", "pass", "dbName")
self.cursor=self.db.cursor()
args=("какой-то текст") #this is string in russian
self.cursor.callproc('pr_MyProc', args)
self.cursor.execute('SELECT #_pr_MyProc_2') #getting result from sp
result=self.cursor.fetchone()
self.db.commit()
I've read that setting charset='utf8' shuld resolve this problem, but when I use string:
self.db=MySQLdb.connect("localhost", "usr", "pass", "dbName", charset='utf8')
This gives me another error;
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd1' in position 20: surrogates not allowed
Also I've trying to set parametr use_unicode=True, that's not working.

More things to check on:
http://mysql.rjweb.org/doc.php/charcoll#python
Likely items:
Start code file with # -*- coding: utf-8 -*- -- (for literals in code)
Literals should be u'...'
Can you extract the HEX? какой-то текст should be this in utf8: D0BA D0B0 D0BA D0BE D0B9 2D D182 D0BE D182 20 D0B5 D0BA D181 D182

Here are some thoughts. Maybe not a response. I've been playing with python/mysql/utf-8/unicode in the past and this is the things i remember:
Looking at Saltstack mysql module's comment :
https://github.com/saltstack/salt/blob/develop/salt/modules/mysql.py#L314-L322
# MySQLdb states that this is required for charset usage
# but in fact it's more than it's internally activated
# when charset is used, activating use_unicode here would
# retrieve utf8 strings as unicode() objects in salt
# and we do not want that.
#_connarg('connection_use_unicode', 'use_unicode')
connargs['use_unicode'] = False
_connarg('connection_charset', 'charset')
We see that to avoid altering the result string the use_unicode is set to False, while the charset (which could be utf-8) is set as a parameter. use_unicode is more a 'request' to get responses as unicode strings.
You can check real usage in the tests, here:
https://github.com/saltstack/salt/blob/develop/tests/integration/modules/test_mysql.py#L311-L361 with a database named '標準語'.
Now about the message UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd1' **. You are using **unicode but you tell the module it is utf-8. It is not utf-8 until you encode your unicode string in utf-8.
Maybe you should try with:
args=(u"какой-то текст".encode('utf-8'))
At least in python3 this is required, because your "какой-то текст" is not in utf-8 by default.

The MySQLdb module is not compatible with python 3. That might be why you are getting problems. I would advise to use a different connector, like PyMySQL or mysqlclient.
Related: 23376103.

Maybe you can reload your sys in utf-8 and try to decode the string into utf-8 as following :
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
...
stringUtf8 = u''.join(string_original).decode('utf-8')

I had a similar problem very recently but with PostgreSQL. After trying tons of suggestions from SO/ internet, I realized the issue was with my database. I had to drop my database and reinstall Postgres, because for some reason it was not allowing me to change the database's default collation. I was in a hurry so couldn't find a better solution, but would recommend the same, since I was only starting my application in the deployment environment.
All the best.

What's your database's charset?
use :
show variables like "characetr%";
or see your database's charset

I see here two problems.
You have unicode but you try to define it as utf-8 by setting parameter "charset". You should first encode your unicode to utf-8 or another encoding system.
If it however doesn't work, try to do so with init_command='SET NAMES UTF8' parameter.
So it will look like:
conn = MySQLdb.connect(charset='utf8', init_command='SET NAMES UTF8')
You can try also this:
cursor = db.cursor()
cursor.execute("SET NAMES UTF8;")

I encountered a similar issue, which was caused by invalid utf-8 data in the database; it seems that MySQL doesn't care about that, but Python does, because it's following the UTF-8 spec, which says that:
surrogate pairs are not allowed in utf-8
unpaired surrogates are not allowed in utf-8
If you want to "make it work", you'll have to intercept the MySQL packet and use your own converter which will perform ad-hoc replacements.
Here's one way to "handle" invalid data containing surrogates:
def borked_utf8_decode(data):
"""
Work around input with unpaired surrogates or surrogate pairs,
replacing by XML char refs: look for "&#\d+;" after.
"""
return data.decode("utf-8", "surrogatepass") \
.encode("utf-8", "xmlcharrefreplace") \
.decode("utf-8")
Note that the proper way to handle that is context-dependent, but there are some common replacement scenarios, like this one.
And here's one way of plugging this into pymysql (another way is to monkey-patch field processing, see eg. https://github.com/PyMySQL/PyMySQL/issues/631):
import pymysql.converters
# use this in your connection
pymysql_use_unicode = False
conversions = pymysql.converters.conversions
conversions[pymysql.converters.FIELD_TYPE.STRING] = borked_utf8_decode
conversions[pymysql.converters.FIELD_TYPE.VAR_STRING] = borked_utf8_decode
conversions[pymysql.converters.FIELD_TYPE.VARCHAR] = borked_utf8_decode

Related

How to enable decoding for specific Python class?

I have following code:
return render_template(
'sample.html',
title=('Härre').decode('utf-8'),
year=datetime.now().year,
message = ("Härre guut").decode('utf-8')
)
Above code is working fine. But I want to know is it possible to enable automatic decoding for special characters for specific class? So that my code becomes like this:
return render_template(
'sample.html',
title=('Härre'),
year=datetime.now().year,
message = ("Härre guut")
)
If yes then how it is done?
In my case special character is letter ä, and it can be decoded with utf-8.
I have tried to to add following line at first line of my code:
# -*- coding: utf-8 -*-
However that won't help. I get following error if I try to enable above line and take out all decode('utf-8') parts:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 12: ordinal not in range(128)
Which is very clear error. It tries to use 'ascii' codec to decode special characters of the code.
If you wonder why I want to enable such functionality, the answer is that in future I'm going to have more render_template() method. This method is used with Flask framework.

If you're only talking about string which you are typing, I'd try putting a u in front of them (eg. u'string'). The u tells python the string is unicode. I've used that successfully in similar circumstances in the past.
Also, according to the flask docs (way at the bottom), to use that # -*- coding: ... bit properly, you have to set your text editor to UTF-8, otherwise it's saving in ascii, which might be why you're still getting that error.

At the risk of being too trivial, you could make this decoding automatic by using Python 3, which uses Unicode instead of ascii.
Otherwise, as Brandon indicated, you do have to take pains with Python 2 to appropriately decode you inputs, use Unicode within your program, prefacing strings awing 'u', and the encoding when you output data, usually to utf-8, but whatever is appropriate for you.
Decoding and encoding at the "the edges" and using Unicode in between, is a lot of work, but necessary. It's a good reason to make the switch to Python 3. :-)

Simply define your strings as Unicode strings using the u prefix:
return render_template(
'sample.html',
title=(u'Härre'),
year=datetime.now().year,
message = (u"Härre guut")
)
As you've baked non-ASCII into your source code, set the 'coding' bit at the top of your source code. Ensure your editor's character encoding matches the coding header.
# -*- coding: utf-8 -*-

How to encode (utf8mb4) in Python

How do I encode something in ut8mb4 in Python?
I have two sets of data: data I am migrating to my new MySQL database over from Parse, and data going forward (that talks only to my new database). My database is utf8mb4 in order to store emoji and accented letters.
The first set of data only shows up correctly (when emoji and accents are involved) when I have in my python script:
MySQLdb.escape_string(unicode(xstr(data.get('message'))).encode('utf-8'))
and when reading from the MySQL database in PHP:
$row["message"] = utf8_encode($row["message"]);
The second set of data only shows up correctly (when emoji and accents are involved) when I DON'T include the utf8_encode($row["message"]) portion. I am trying to reconcile these so that both sets of data are returned correctly to my iOS app. Please help!

I have struggled myself with the correct exchange of the full range of UTF-8 characters between Python and MySQL for the sake of Emoji and other characters beyond the U+FFFF codepoint.
To be sure that everything worked fine, I had to do the following:
make sure utf8mb4 was used for CHAR, VARCHAR, and TEXT columns in MySQL
enforce UTF-8 in Python
enforce UTF-8 to be used between Python and MySQL
To enforce UTF-8 in Python, add the following line as first or second line of your Python script:
# -*- coding: utf-8 -*-
To enforce UTF-8 between Python and MySQL, setup the MySQL connection as follows:
# Connect to mysql.
dbc = MySQLdb.connect(host='###', user='###', passwd='###', db='###', use_unicode=True)
# Create a cursor.
cursor = dbc.cursor()
# Enforce UTF-8 for the connection.
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
# Do database stuff.
# Commit data.
dbc.commit()
# Close cursor and connection.
cursor.close()
dbc.close()
This way, you don't need to use functions such as encode and utf8_encode.

MySQL's utf8mb4 encoding is just standard UTF-8.
They had to add that name however to distinguish it from the broken UTF-8 character set which only supported BMP characters.
In other words, from the Python side you should always encode to UTF-8 when talking to MySQL, but take into account that the database may not be able to handle Unicode codepoints beyond U+FFFF, unless you use utf8mb4 on the MySQL side.
However, generally speaking, you want to avoid manually encoding and decoding, and instead leave it to MySQLdb worry about this. You do this by configuring your connection and your collations to handle Unicode text transparently. For MySQLdb, that means setting charset='utf8mb4':
database = MySQLdb.connect(
host=hostname,
user=username,
passwd=password,
db=databasename,
charset="utf8mb4"
)
Then use normal Python 3 str strings; leave the use_unicode option set to it's default True*.
Note: this handles SET NAMES and SET character_set_connection) for you, there is no need to issue those manually.
* Unless you still use Python 2, then the default is False. Set it to True and use u'...' unicode strings.

use_unicode=True didn't work for me.
My solution
in mysql, change entire database, table and field encoding to utf8mb4
MySQLdb.connect(host='###' [...], charset='utf8'
dbCursor.execute('SET NAMES utf8mb4')
dbCursor.execute("SET CHARACTER SET utf8mb4")

You can also enter the type of code that you want in the following way
mysql.connector.connect(host = '<host>', database = '<db>', user = '<user>', password = '<password>', charset = 'utf8')
The fields inside '<>' are your own details. Instead of 'utf8' you can also write 'utf8mb4' depending on the type of coding your mysqldb wants.

string decode method error in python

I have a function like this:
def convert_to_unicode(data):
row = {}
if data == None:
return data
try:
for key, val in data.items():
if isinstance(val, str):
row[key] = unicode(val.decode('utf8'))
else:
row[key] = val
return row
except Exception, ex:
log.debug(ex)
to which I feed a result set (got using MySQLdb.cursors.DictCursor) row by row to transform all the string values to unicode (example {'column_1':'XXX'} becomes {'column_1':u'XXX'}).
Problem is one of the rows has a value like {'column_1':'Gabriel García Márquez'}
and it does not get transformed. it throws this error:
'utf8' codec can't decode byte 0xed in position 12: invalid continuation byte
When I searched for this it seems that this has to do with ascii encoding.
The solutions i tried are:
adding # -*- coding: utf-8 -*- at the beginning of my file ... does not help
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('utf8', 'ignore')) ... as expected it ignores the non-ascii character and returns {'column_1':u'Gabriel Garca Mrquez'}
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('latin-1')) ... Does the job but I am afraid it will support only West Europe characters (as per Here )
Can anybody point me towards a right direction please.

Firstly:
The data you're getting in your result set is clearly latin-1 encoded, or you wouldn't be observing this behavior. It is entirely correct that trying to decode a latin-1-encoded byte string as though it were utf-8-encoded blows up in your face. Once you have a latin-1-encoded byte string foo, if you want to convert it to the unicode type, foo.decode('latin1') is the right thing to do.
I noticed the expression unicode(val.decode('utf8')) in your code. This is equivalent to just val.decode('utf8'); calling the .decode method of a byte string converts it to unicode, so you're calling unicode() on a unicode string, which just returns the unicode string.
Secondly:
Your real problem here - if you want to be able to deal with characters not included in the character set supported by the latin-1 encoding - is not with Python's string types, per se, so much as it is with the MySQLdb library. I don't know this problem in intimate detail, but as I understand it, in ancient versions of MySQL, the default encoding used by MySQL databases was latin-1, but now it is utf-8 (and has been for many years). The MySQLdb library, however, still by default establishes latin-1-encoded connections with the database. There are literally dozens of StackOverflow questions relating to MySQL, Python, and string encoding, and while I don't fully understand them, one easy-to-use solution to all such problems that seems to work for people is this one:
http://www.dasprids.de/blog/2007/12/17/python-mysqldb-and-utf-8
I wish I could give you a more comprehensive and confident answer on the MySQLdb issue, but I've never even used MySQL and I don't want to risk posting anything untrue. Perhaps someone can come along and provide more detail. Nonetheless, I hope this helps you.

Your third solution - changing the encoding to "latin-1" - is correct. Your input data is encoded as Latin-1, so that's what you have to decode it as. Unless someone somewhere did something very silly, it should be impossible for that input data to contain invalid characters for that encoding.

UnicodeEncodeError: 'latin-1' codec can't encode character

What could be causing this error when I try to insert a foreign character into the database?
>>UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in position 0: ordinal not in range(256)
And how do I resolve it?
Thanks!

I ran into this same issue when using the Python MySQLdb module. Since MySQL will let you store just about any binary data you want in a text field regardless of character set, I found my solution here:
Using UTF8 with Python MySQLdb
Edit: Quote from the above URL to satisfy the request in the first comment...
"UnicodeEncodeError:'latin-1' codec can't encode character ..."
This is because MySQLdb normally tries to encode everythin to latin-1.
This can be fixed by executing the following commands right after
you've etablished the connection:
db.set_character_set('utf8')
dbc.execute('SET NAMES utf8;')
dbc.execute('SET CHARACTER SET utf8;')
dbc.execute('SET character_set_connection=utf8;')
"db" is the result of MySQLdb.connect(), and "dbc" is the result of
db.cursor().

Character U+201C Left Double Quotation Mark is not present in the Latin-1 (ISO-8859-1) encoding.
It is present in code page 1252 (Western European). This is a Windows-specific encoding that is based on ISO-8859-1 but which puts extra characters into the range 0x80-0x9F. Code page 1252 is often confused with ISO-8859-1, and it's an annoying but now-standard web browser behaviour that if you serve your pages as ISO-8859-1, the browser will treat them as cp1252 instead. However, they really are two distinct encodings:
>>> u'He said \u201CHello\u201D'.encode('iso-8859-1')
UnicodeEncodeError
>>> u'He said \u201CHello\u201D'.encode('cp1252')
'He said \x93Hello\x94'
If you are using your database only as a byte store, you can use cp1252 to encode “ and other characters present in the Windows Western code page. But still other Unicode characters which are not present in cp1252 will cause errors.
You can use encode(..., 'ignore') to suppress the errors by getting rid of the characters, but really in this century you should be using UTF-8 in both your database and your pages. This encoding allows any character to be used. You should also ideally tell MySQL you are using UTF-8 strings (by setting the database connection and the collation on string columns), so it can get case-insensitive comparison and sorting right.

The best solution is
set mysql's charset to 'utf-8'
do like this comment(add use_unicode=True and charset="utf8")
db = MySQLdb.connect(host="localhost", user = "root", passwd = "", db = "testdb", use_unicode=True, charset="utf8") – KyungHoon Kim Mar
13 '14 at 17:04
detail see :
class Connection(_mysql.connection):
"""MySQL Database Connection Object"""
default_cursor = cursors.Cursor
def __init__(self, *args, **kwargs):
"""
Create a connection to the database. It is strongly recommended
that you only use keyword parameters. Consult the MySQL C API
documentation for more information.
host
string, host to connect
user
string, user to connect as
passwd
string, password to use
db
string, database to use
port
integer, TCP/IP port to connect to
unix_socket
string, location of unix_socket to use
conv
conversion dictionary, see MySQLdb.converters
connect_timeout
number of seconds to wait before the connection attempt
fails.
compress
if set, compression is enabled
named_pipe
if set, a named pipe is used to connect (Windows only)
init_command
command which is run once the connection is created
read_default_file
file from which default client values are read
read_default_group
configuration group to use from the default file
cursorclass
class object, used to create cursors (keyword only)
use_unicode
If True, text-like columns are returned as unicode objects
using the connection's character set. Otherwise, text-like
columns are returned as strings. columns are returned as
normal strings. Unicode objects will always be encoded to
the connection's character set regardless of this setting.
charset
If supplied, the connection character set will be changed
to this character set (MySQL-4.1 and newer). This implies
use_unicode=True.
sql_mode
If supplied, the session SQL mode will be changed to this
setting (MySQL-4.1 and newer). For more details and legal
values, see the MySQL documentation.
client_flag
integer, flags to use or 0
(see MySQL docs or constants/CLIENTS.py)
ssl
dictionary or mapping, contains SSL connection parameters;
see the MySQL documentation for more details
(mysql_ssl_set()). If this is set, and the client does not
support SSL, NotSupportedError will be raised.
local_infile
integer, non-zero enables LOAD LOCAL INFILE; zero disables
autocommit
If False (default), autocommit is disabled.
If True, autocommit is enabled.
If None, autocommit isn't set and server default is used.
There are a number of undocumented, non-standard methods. See the
documentation for the MySQL C API for some hints on what they do.
"""

I hope your database is at least UTF-8. Then you will need to run yourstring.encode('utf-8') before you try putting it into the database.

Use the below snippet to convert the text from Latin to English
import unicodedata
def strip_accents(text):
return "".join(char for char in
unicodedata.normalize('NFKD', text)
if unicodedata.category(char) != 'Mn')
strip_accents('áéíñóúü')
output:
'aeinouu'

You are trying to store a Unicode codepoint \u201c using an encoding ISO-8859-1 / Latin-1 that can't describe that codepoint. Either you might need to alter the database to use utf-8, and store the string data using an appropriate encoding, or you might want to sanitise your inputs prior to storing the content; i.e. using something like Sam Ruby's excellent i18n guide. That talks about the issues that windows-1252 can cause, and suggests how to process it, plus links to sample code!

SQLAlchemy users can simply specify their field as convert_unicode=True.
Example:
sqlalchemy.String(1000, convert_unicode=True)
SQLAlchemy will simply accept unicode objects and return them back, handling the encoding itself.
Docs

Latin-1 (aka ISO 8859-1) is a single octet character encoding scheme, and you can't fit \u201c (“) into a byte.
Did you mean to use UTF-8 encoding?

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 106: ordinal not in range(256)
Solution 1:
\u2013 - google the character meaning to identify what character actually causing this error, Then you can replace that specific character, in the string with some other character, that's part of the encoding you are using.
Solution 2:
Change the string encoding to some encoding which includes all the character of your string. and then you can print that string, it will work just fine.
below code is used to change encoding of the string , borrowed from #bobince
u'He said \u201CHello\u201D'.encode('cp1252')

The latest version of mysql.connector has only
db.set_charset_collation('utf8', 'utf8_general_ci')
and NOT
db.set_character_set('utf8') //This feature is not available

I ran into the same problem when I was using PyMySQL. I checked this package version, it's 0.7.9.
Then I uninstall it and reinstall PyMySQL-1.0.2, the issue is solved.
pip uninstall PyMySQL
pip install PyMySQL

Python: You will need to add
# - * - coding: UTF-8 - * - (remove the spaces around * )
to the first line of the python file. and then add the following to the text to encode: .encode('ascii', 'xmlcharrefreplace'). This will replace all the unicode characters with it's ASCII equivalent.

pysqlite2: ProgrammingError - You must not use 8-bit bytestrings

I'm currently persisting filenames in a sqlite database for my own purposes. Whenever I try to insert a file that has a special character (like é etc.), it throws the following error:
pysqlite2.dbapi2.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
When I do "switch my application over to Unicode strings" by wrapping the value sent to pysqlite with the unicode method like: unicode(filename), it throws this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 66: ordinal not in range(128)
Is there something I can do to get rid of this? Modifying all of my files to conform isn't an option.
UPDATE
If I decode the text via filename.decode("utf-8"), I'm still getting the ProgrammingError above.
My actual code looks like this:
cursor.execute("select * from musiclibrary where absolutepath = ?;",
[filename.decode("utf-8")])
What should my code here look like?

You need to specify the encoding of filename for conversion to Unicode, for example: filename.decode('utf-8'). Just using unicode(...) picks the console encoding, which is often unreliable (and often ascii).

You should pass as Unicode the arguments of your SQL statement.
Now, it all depends on how you obtain the filename list. Perhaps you're reading the filesystem using os.listdir or os.walk? If that is the case, there is a way to have directly the filenames as Unicode just by passing a Unicode argument to either of these functions:
Examples:
os.listdir(u'.')
os.walk(u'.')
Of course, you can substitute the u'.' directory with the actual directory whose contents you are reading. Just make sure it's a Unicode string.

Have you tried to pass the unicode string directly:
cursor.execute("select * from musiclibrary where absolutepath = ?;",(u'namé',))
You will need to add the file encoding at the beginning of the script:
# coding: utf-8

You figured this out already, but:
I don't think you could actually get that ProgrammingError exception from cursor.execute("select * from musiclibrary where absolutepath = ?;", [filename.decode("utf-8")]), as the question currently states.
Either the utf-8 decode would explode, or the cursor.execute call would be happy with the result.

Try to change to this:
cursor.execute("select * from musiclibrary where absolutepath = ?;",
[unicode(filename,'utf8')])
In your filename origin not encode with utf8, change utf8 to your encoding.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.