Trying to save special characters in MySQL DB

Trying to save special characters in MySQL DB - python

I have a string that looks like this 🔴Use O Mozilla Que Não Trava! Testei! $vip ou $apoio
When I try to save it to my database with ...SET description = %s... and cursor.execute(sql, description) it gives me an error
Warning: (1366, "Incorrect string value: '\xF0\x9F\x94\xB4Us...' for column 'description' ...
Assuming this is an ASCII symbol, I tried description.decode('ascii') but this leads to
'str' object has no attribute 'decode'
How can I determine what encoding it is and how could I store anything like that to the database? The database is utf-8 encoded if that is important.
I am using Python3 and PyMySQL.
Any hints appreciated!

First, you need to make sure the table column has correct character set setting. If it is "latin1" you will not be able to store content that contains Unicode characters.
You can use following query to determine the column character set:
SELECT CHARACTER_SET_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA='your_database_name' AND TABLE_NAME='your_table_name' AND COLUMN_NAME='description'
Following Mysql document here if you want to change column character set.
Also, you need to make sure character set is properly configured for Mysql connection. Quoted from Mysql doc:
Character set issues affect not only data storage, but also
communication between client programs and the MySQL server. If you
want the client program to communicate with the server using a
character set different from the default, you'll need to indicate
which one. For example, to use the utf8 Unicode character set, issue
this statement after connecting to the server:
SET NAMES 'utf8';
Once character set setting is correct, you will be able to execute your sql statement. There is no need to encode / decode in Python side. That is used for different purposes.

Related

Sending UTF-8 formatted emojis from Android to Python API

I have been trying to send emojis through post petitions to my server (python server-side) to store in a database. I get the full string and convert it to UTF-8, the problem is that some emojis are well sent and others throw an error on server-side Incorrect string value: '\\xF0\\x9F\\x8E\\xAE
I think this is because some emojis are converted to this %E2%9D%A4%EF%B8%8F on sending like ❤️, but others are converted to this %F0%9F%8E%AE like 🎮.
I have tested the petitions through postman and the red heart one works, but the others, with 4 codes don't and I see that error.
Here is some postman log capture
And here is the error from Python django API
OperationalError at /api/addcomment
(1366, "Incorrect string value: '\\xF0\\x9F\\x8E\\xAE' for column 'text' at row 1")
Django Version: 2.2.5
Exception Type: OperationalError
Exception Value:
(1366, "Incorrect string value: '\\xF0\\x9F\\x8E\\xAE' for column 'text' at row 1")
Exception Location: /var/www/vhosts/*/httpdocs/pythonvenv/lib/python3.5/site-packages/MySQLdb/connections.py in query, line 226
Python Executable: /var/www/vhosts/*/httpdocs/pythonvenv/bin/python
Python Version: 3.5.2
Python Path:
['/var/www/vhosts/*/httpdocs/pythonvenv/bin',
'/var/www/vhosts/*/httpdocs/app/app',
'/var/www/vhosts/*/httpdocs/app',
'/var/www/vhosts/*/httpdocs',
'/usr/share/passenger/helper-scripts',
'/var/www/vhosts/*/httpdocs/pythonvenv/lib/python35.zip',
'/var/www/vhosts/*/httpdocs/pythonvenv/lib/python3.5',
'/var/www/vhosts/*/httpdocs/pythonvenv/lib/python3.5/plat-x86_64-linux-gnu',
'/var/www/vhosts/*/httpdocs/pythonvenv/lib/python3.5/lib-dynload',
'/usr/lib/python3.5',
'/usr/lib/python3.5/plat-x86_64-linux-gnu',
'/var/www/vhosts/*/httpdocs/pythonvenv/lib/python3.5/site-packages']
I have changed the original URL with *
For more info, in phpmyadmin i cannot insert those emojis either (the 4 codes ones like the gamepad) on SQL or insert tab, but i can insert the 6 codes ones like the red heart on SQL or insert tab. I have tried several utf8 and utf8mb4 collations for both column and table.
This happens when inserting an emoji with db, table and column set to utf8mb4 or not
Any help? Thanks!

Both of these need to be set to utf8mb4:
The column charset
The database connection charset
The first one determines what strings can be stored in the column. The second determines the character set for string literals. (Oddly, if you put a 4-byte UTF-8 sequence in a string literal, MySQL can still think it's "3-byte utf8" and doesn't give an error until you try to use it)
To find if the database connection charset is the problem, you can try setting the character set on the string literal explicitly. If this works, the column encoding is fine, but the connection isn't:
insert into demo_table set `text` = _utf8mb4'🎮';
You seem to be using Django. I don't know much about Django but it looks like the connection encoding is set somewhere in the database connection options. Going by https://chriskief.com/2017/06/18/django-and-mysql-emoticons/ :
DATABASES = {
'default': {
'ENGINE':'django.db.backends.mysql',
...
'OPTIONS': {'charset': 'utf8mb4'},
}
}

Confused about encoding issue when read from mysql via python code

There is one row in Mysql table as following:
1000, Intel® Rapid Storage Technology
The table's charset='utf8' when was created.
When I used python code to read it, it become the following:
IntelÂ® Management Engine Firmware
My python code as following:
db = MySQLdb.connect(db,user,passwd,dbName,port,charset='utf8')
The weird thing was that when I removed the charset='utf8' as following:
db = MySQLdb.connect(db,user,passwd,dbName,port), the result become correct.
Why when I indicated charset='utf8' in my code, but got wrong result please?

Have you tried leaving off the charset in the connect string and then setting afterwards?
db = MySQLdb.connect(db,user,passwd,dbName,port)
db.set_character_set('utf8')

When trying to use utf8/utf8mb4, if you see Mojibake, check the following.
This discussion also applies to Double Encoding, which is not necessarily visible.
The bytes to be stored need to be utf8-encoded.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4).
HTML should start with <meta charset=UTF-8>.
See also Python notes

How to read national characters (>127) from US7ASCII Oracle using Python cx_Oracle?

I have problem with displaying national characters from “ENGLISH_UNITED KINGDOM.US7ASCII” Oracle 11 database using Python 3.3 cx_Oracle 5.1.2 and "NLS_LANG" environment variable.
Db table column type is "VARCHAR2(2000 BYTE)"
How to display string "£aÀÁÂÃÄÅÆÇÈ" from Oracle US7ASCII in Python? This will be some sort of hack.
The hank works in every other scripting language Perl, PHP, PL/SQL and in Python 2.7, but it does not work in Python 3.3.
In Oracle 11 Database I created SECURITY_HINTS.ANSWER="£aÀÁÂÃÄÅÆÇÈ". ANSWER column type is "VARCHAR2(2000 BYTE)".
Now when using cx_Oracle and default NLS_LANG, I get "¿a¿¿¿¿¿¿¿¿¿"
and when using NLS_LANG="ENGLISH_UNITED KINGDOM.US7ASCII" I get
"UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0: ordinal not in range(128)"
Update1
I made some progress. When switching to Python 2.7 and cx_Oracle 5.1.2 for Python 2.7 the problem goes away (I get all >127 characters from db). In Python 2 strings are represented as bytes and in Python 3+ strings are represented as unicode. I still need best possible solution for Python 3.3.
Update2
One possible solution to the problem is to used rawtohex(utl_raw.cast_to_raw see code below.
cursor.execute("select rawtohex(utl_raw.cast_to_raw(ANSWER)) from security_hints where userid = '...'")
for rawValue in cursor:
print (''.join(['%c' % iterating_var for iterating_var in binascii.unhexlify(rawValue[0])]))
source code of my script is below or at GitHub and GitHub Sollution
def test_nls(nls_lang=None):
print (">>> run test_nls for %s" %(nls_lang))
if nls_lang:
os.environ["NLS_LANG"] = nls_lang
os.environ["ORA_NCHAR_LITERAL_REPLACE"] = "TRUE"
connection = get_connection()
cursor = connection.cursor()
print("version=%s\nencoding=%s\tnencoding=%s\tmaxBytesPerCharacter=%s" %(connection.version, connection.encoding,
connection.nencoding, connection.maxBytesPerCharacter))
cursor.execute("SELECT USERENV ('language') FROM DUAL")
for result in cursor:
print("%s" %(result))
cursor.execute("select ANSWER from SECURITY_HINTS where USERID = '...'")
for rawValue in cursor:
print("query returned [%s]" % (rawValue))
answer = rawValue[0]
str = ""
for iterating_var in answer:
str = ("%s [%d]" % (str, ord(iterating_var)))
print ("str %s" %(str))
cursor.close()
connection.close()
if __name__ == '__main__':
test_nls()
test_nls(".AL32UTF8")
test_nls("ENGLISH_UNITED KINGDOM.US7ASCII")
see log output below.
run test_nls for None
version=11.1.0.7.0
encoding=WINDOWS-1252 nencoding=WINDOWS-1252 maxBytesPerCharacter=1
ENGLISH_UNITED KINGDOM.US7ASCII
query returned [¿a¿¿¿¿¿¿¿¿¿]
str [191] [97] [191] [191] [191] [191] [191] [191] [191] [191] [191
run test_nls for .AL32UTF8
version=11.1.0.7.0
encoding=UTF-8 nencoding=UTF-8 maxBytesPerCharacter=4
AMERICAN_AMERICA.US7ASCII
query returned [�a���������]
str [65533] [97] [65533] [65533] [65533] [65533] [65533] [65533] [65533] [65533] [65533]
run test_nls for ENGLISH_UNITED KINGDOM.US7ASCII
version=11.1.0.7.0
encoding=US-ASCII nencoding=US-ASCII maxBytesPerCharacter=1
ENGLISH_UNITED KINGDOM.US7ASCII
Traceback (most recent call last):
File "C:/dev/tmp/Python_US7ASCII_cx_Oracle/showUS7ASCII.py", line 71, in <module>
test_nls("ENGLISH_UNITED KINGDOM.US7ASCII")
File "C:/dev/tmp/Python_US7ASCII_cx_Oracle/showUS7ASCII.py", line 55, in test_nls
for rawValue in cursor:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0: ordinal not in range(128)
I am trying to Display it in Django Web page. But each character comes as character with code 191 or 65533.
I looked at
choosing NLS_LANG for Oracle and
Importing from Oracle using the correct encoding with Python
Cannot Insert Unicode Using cx-Oracle

If you want to get unchanged ASCII string in client application, the best way is transfer it from DB in binary mode. So, first conversion must be down on server side with help of UTL_RAW package and standard rawtohex function.
Your select in cursor.execute may look like that:
select rawtohex(utl_raw.cast_to_raw(ANSWER)) from SECURITY_HINTS where USERID = '...'
On the client you got a string of hexadecimal characters which may be converted to a string representation with help of binascii.unhexlify function:
for rawValue in cursor:
print("query returned [%s]" % (binascii.unhexlify(rawValue)))
P.S. I didn't know a Python language, so last statement may be incorrect.

I think you should not revert to such evil trickery. NLS_LANG should simply be set to the client's default encoding. Look at more solid options:
Extend the character set of the database to allow these characters in a VARCHAR column.
Upgrade this particular column to NVARCHAR. You could perhaps use a new name for this column and create a VARCHAR computed column with the old name for the legacy applications to read.
Keep the database as is but check the data when it gets entered and replace all non-ASCII characters with an acceptable ASCII equivalent.
Which option is best depends on how common the non-ASCII characters are. If there's more tables with the same issue, I'd suggest option 1. If this is the only table, option 2. If there are only a couple non-ASCII characters in the entire table, and their loss is not that big a deal: option 3.
One of the tasks of a database is to preserve the quality of your data after all, and if you cheat when forcibly inserting illegal characters into the column, it cannot do its job properly and each new client or upgrade or export will come with interesting new undefined behavior.
EDIT: See Oracle's comment on an example of a similar setup in the NLS_LANG faq (my emphasis):
A database is created on a UNIX system with the US7ASCII character
set. A Windows client connecting to the database works with the
WE8MSWIN1252 character set (regional settings -> Western Europe /ACP
1252) and the DBA, use the UNIX shell (ROMAN8) to work on the
database. The NLS_LANG is set to american_america.US7ASCII on the
clients and the server.
Note:
This is an INCORRECT setup to explain character set conversion, don't
use it in your environment!

Unable to convert PostgreSQL text column to bytea

In my application I am using a postgresql database table with a "text" column to store
pickled python objects.
As database driver I'm using psycopg2 and until now I only passed python-strings (not unicode-objects) to the DB and retrieved strings from the DB. This basically worked fine until I recently decided to make String-handling the better/correct way and added the following construct to my DB-layer:
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
This basically works fine everywhere in my application and I'm using unicode-objects where possible now.
But for this special case with the text-column containing the pickled objects it makes troubles. I got it working in my test-system this way:
retrieving the data:
SELECT data::bytea, params FROM mytable
writing the data:
execute("UPDATE mytable SET data=%s", (psycopg2.Binary(cPickle.dumps(x)),) )
... but unfortunately I'm getting errors with the SELECT for some columns in the production-system:
psycopg2.DataError: invalid input syntax for type bytea
This error also happens when I try to run the query in the psql shell.
Basically I'm planning to convert the column from "text" to "bytea", but the error
above also prevents me from doing this conversion.
As far as I can see, (when retrieving the column as pure python string) there are only characters with ord(c)<=127 in the string.

The problem is that casting text to bytea doesn't mean, take the bytes in the string and assemble them as a bytea value, but instead take the string and interpret it as an escaped input value to the bytea type. So that won't work, mainly because pickle data contains lots of backslashes, which bytea interprets specially.
Try this instead:
SELECT convert_to(data, 'LATIN1') ...
This converts the string into a byte sequence (bytea value) in the LATIN1 encoding. For you, the exact encoding doesn't matter, because it's all ASCII (but there is no ASCII encoding).

using pyodbc on linux to insert unicode or utf-8 chars in a nvarchar mssql field

I am using Ubuntu 9.04
I have installed the following package versions:
unixodbc and unixodbc-dev: 2.2.11-16build3
tdsodbc: 0.82-4
libsybdb5: 0.82-4
freetds-common and freetds-dev: 0.82-4
I have configured /etc/unixodbc.ini like this:
[FreeTDS]
Description = TDS driver (Sybase/MS SQL)
Driver = /usr/lib/odbc/libtdsodbc.so
Setup = /usr/lib/odbc/libtdsS.so
CPTimeout =
CPReuse =
UsageCount = 2
I have configured /etc/freetds/freetds.conf like this:
[global]
tds version = 8.0
client charset = UTF-8
I have grabbed pyodbc revision 31e2fae4adbf1b2af1726e5668a3414cf46b454f from http://github.com/mkleehammer/pyodbc and installed it using "python setup.py install"
I have a windows machine with Microsoft SQL Server 2000 installed on my local network, up and listening on the local ip address 10.32.42.69. I have an empty database created with name "Common". I have the user "sa" with password "secret" with full priviledges.
I am using the following python code to setup the connection:
import pyodbc
odbcstring = "SERVER=10.32.42.69;UID=sa;PWD=secret;DATABASE=Common;DRIVER=FreeTDS"
con = pyodbc.connect(s)
cur = con.cursor()
cur.execute('''
CREATE TABLE testing (
id INTEGER NOT NULL IDENTITY(1,1),
name NVARCHAR(200) NULL,
PRIMARY KEY (id)
)
''')
con.commit()
Everything WORKS up to this point. I have used SQLServer's Enterprise Manager on the server and the new table is there.
Now I want to insert some data on the table.
cur = con.cursor()
cur.execute('INSERT INTO testing (name) VALUES (?)', (u'something',))
That fails!! Here's the error I get:
pyodbc.Error: ('HY004', '[HY004] [FreeTDS][SQL Server]Invalid data type
(0) (SQLBindParameter)'
Since my client is configured to use UTF-8 I thought I could solve by encoding data to UTF-8. That works, but then I get back strange data:
cur = con.cursor()
cur.execute('DELETE FROM testing')
cur.execute('INSERT INTO testing (name) VALUES (?)', (u'somé string'.encode('utf-8'),))
con.commit()
# fetching data back
cur = con.cursor()
cur.execute('SELECT name FROM testing')
data = cur.fetchone()
print type(data[0]), data[0]
That gives no error, but the data returned is not the same data sent! I get:
<type 'unicode'> somÃ© string
That is, pyodbc won't accept an unicode object directly, but it returns unicode objects back to me! And the encoding is being mixed up!
Now for the question:
I want code to insert unicode data in a NVARCHAR and/or NTEXT field. When I query back, I want the same data I inserted back.
That can be by configuring the system differently, or by using a wrapper function able to convert the data correctly to/from unicode when inserting or retrieving
That's not asking much, is it?

I can remember having this kind of stupid problems using odbc drivers, even if that time it was a java+oracle combination.
The core thing is that odbc driver apparently encodes the query string when sending it to the DB. Even if the field is Unicode, and if you provide Unicode, in some cases it does not seem to matter.
You need to ensure that what is sent by the driver has the same encoding as your Database (not only server, but also database). Otherwise, of course you get funky characters because either the client or the server is mixing things up when encoding/or decoding. Do you have any idea of the charset (codepoint as MS like to say) that your server is using as a default for decoding data?
Collation has nothing to do with this problem :)
See that MS page for example. For Unicode fields, collation is used only to define the sort order in the column, not to specify how the data is stored.
If you store your data as Unicode, there is an Unique way to represent it, that's the purpose of Unicode: no need to define a charset that is compatible with all the languages that you are going to use :)
The question here is "what happens when I give data to the server that is not Unicode?". For example:
When I send an UTF-8 string to the server, how does it understand it?
When I send an UTF-16 string to the server, how does it understand it?
When I send a Latin1 string to the server, how does it understand it?
From the server perspective, all these 3 strings are only a stream of bytes. The server cannot guess the encoding in which you encoded them. Which means that you will get troubles if your odbc client ends up sending bytestrings (an encoded string) to the server instead of sending unicode data: if you do so, the server will use a predefined encoding (that was my question: what encoding the server will use? Since it is not guessing, it must be a parameter value), and if the string had been encoded using a different encoding, dzing, data will get corrupted.
It's exactly similar as doing in Python:
uni = u'Hey my name is André'
in_utf8 = uni.encode('utf-8')
# send the utf-8 data to server
# send(in_utf8)
# on server side
# server receives it. But server is Japanese.
# So the server treats the data with the National charset, shift-jis:
some_string = in_utf8 # some_string = receive()
decoded = some_string.decode('sjis')
Just try it. It's fun. The decoded string is supposed to be "Hey my name is André", but is "Hey my name is Andrﾃｩ". é gets replaced by Japanese ﾃｩ
Hence my suggestion: you need to ensure that pyodbc is able to send directly the data as Unicode. If pyodbc fails to do this, you will get unexpected results.
And I described the problem in the Client to Server way. But the same sort of issues can arise when communicating back from the Server to the Client. If the Client cannot understand Unicode data, you'll likely get into troubles.
FreeTDS handles Unicode for you.
Actually, FreeTDS takes care of things for you and translates all the data to UCS2 unicode. (Source).
Server <--> FreeTDS : UCS2 data
FreeTDS <--> pyodbc : encoded strings, encoded in UTF-8 (from /etc/freetds/freetds.conf)
So I would expect your application to work correctly if you pass UTF-8 data to pyodbc. In fact, as this django-pyodbc ticket states, django-pyodbc communicates in UTF-8 with pyodbc, so you should be fine.
FreeTDS 0.82
However, cramm0 says that FreeTDS 0.82 is not completely bugfree, and that there are significant differences between 0.82 and the official patched 0.82 version that can be found here. You should probably try using the patched FreeTDS
Edited: removed old data, which had nothing to do with FreeTDS but was only relevant to Easysoft commercial odbc driver. Sorry.

I use UCS-2 to interact with SQL Server, not UTF-8.
Correction: I changed the .freetds.conf entry so that the client uses UTF-8
tds version = 8.0
client charset = UTF-8
text size = 32768
Now, bind values work fine for UTF-8 encoded strings.
The driver converts transparently between the UCS-2 used for storage on the dataserver side and the UTF-8 encoded strings given to/taken from the client.
This is with pyodbc 2.0 on Solaris 10 running Python 2.5 and FreeTDS freetds-0.82.1.dev.20081111 and SQL Server 2008
import pyodbc
test_string = u"""Comment ça va ? Très bien ?"""
print type(test_string),repr(test_string)
utf8 = 'utf8:' + test_string.encode('UTF-8')
print type(utf8), repr(utf8)
c = pyodbc.connect('DSN=SA_SQL_SERVER_TEST;UID=XXX;PWD=XXX')
cur = c.cursor()
# This does not work as test_string is not UTF-encoded
try:
cur.execute('INSERT unicode_test(t) VALUES(?)', test_string)
c.commit()
except pyodbc.Error,e:
print e
# This one does:
try:
cur.execute('INSERT unicode_test(t) VALUES(?)', utf8)
c.commit()
except pyodbc.Error,e:
print e
Here is the output from the test table (I had manually put in a bunch of test data via Management Studio)
In [41]: for i in cur.execute('SELECT t FROM unicode_test'):
....: print i
....:
....:
('this is not a banana', )
('\xc3\x85kergatan 24', )
('\xc3\x85kergatan 24', )
('\xe6\xb0\xb4 this is code-point 63CF', )
('Mich\xc3\xa9l', )
('Comment a va ? Trs bien ?', )
('utf8:Comment \xc3\xa7a va ? Tr\xc3\xa8s bien ?', )
I was able to put in some in unicode code points directly into the table from Management Studio by the 'Edit Top 200 rows' dialog and entering the hex digits for the unicode code point and then pressing Alt-X

I had the same problem when trying to bind unicode parameter:
'[HY004] [FreeTDS][SQL Server]Invalid data type (0) (SQLBindParameter)'
I solved it by upgrading freetds to version 0.91.
I use pyodbc 2.1.11. I had to apply this patch to make it work with unicode, otherwise I was getting memory corruption errors occasionally.

Are you sure it's INSERT that's causing problem not reading?
There's a bug open on pyodbc Problem fetching NTEXT and NVARCHAR data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.