Confused about encoding issue when read from mysql via python code

Confused about encoding issue when read from mysql via python code - python

There is one row in Mysql table as following:
1000, Intel® Rapid Storage Technology
The table's charset='utf8' when was created.
When I used python code to read it, it become the following:
IntelÂ® Management Engine Firmware
My python code as following:
db = MySQLdb.connect(db,user,passwd,dbName,port,charset='utf8')
The weird thing was that when I removed the charset='utf8' as following:
db = MySQLdb.connect(db,user,passwd,dbName,port), the result become correct.
Why when I indicated charset='utf8' in my code, but got wrong result please?

Have you tried leaving off the charset in the connect string and then setting afterwards?
db = MySQLdb.connect(db,user,passwd,dbName,port)
db.set_character_set('utf8')

When trying to use utf8/utf8mb4, if you see Mojibake, check the following.
This discussion also applies to Double Encoding, which is not necessarily visible.
The bytes to be stored need to be utf8-encoded.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4).
HTML should start with <meta charset=UTF-8>.
See also Python notes

Related

Python MySQLdb file load truncating rows, works fine when loading file from another mysql client

I'm getting data loss when doing a csv import using the Python MySQLdb module. The crazy thing is that I can load the exact same csv using other MySQL clients and it works fine.
It works perfectly fine when running the exact same command with the exact same csv from sequel pro mysql client
It works perfectly fine when running the exact same command with the exact same csv from the mysql command line
It doesn't work (some rows truncated) when loading through python script using mysqldb module.
It's truncating about 10 rows off of my 7019 row csv.
The command I'm calling:
LOAD DATA LOCAL INFILE '/path/to/load.txt' REPLACE INTO TABLE tble_name FIELDS TERMINATED BY ","
When the above command is ran using the native mysql client on linux or sequel pro mysql client on mac it works fine and I get 7019 rows imported.
When the above command is ran using Python's MySQLdb module such as:
dest_cursor.execute( '''LOAD DATA LOCAL INFILE '/path/to/load.txt' REPLACE INTO TABLE tble_name FIELDS TERMINATED BY ","''' )
dest_db.commit()
Most all rows are imported but I get thrown out a slew of
Warning: (1265L, "Data truncated for column '<various_column_names' at row <various_rows>")
When the warnings pop up, it states at row <row_num> but I'm not seeing that correlate to the row in the csv (I think it's the row it's trying to create on the target table, not the row in the csv) so I can't use that to help troubleshoot.
And sure enough, when it's done, my target table is missing some rows.
Unfortunately with over 7,000 rows in the csv it's hard to tell exactly which line it's choking on for further analysis. When the warnings pop up, it states at row <row_num> but I'm not seeing that correlate to the row in the csv (I think it's the row it's trying to create on the target table, not the row in the csv) so I can't use that to help troubleshoot.
There are many rows that are null and/or empty spaces but they are importing fine.
The fact that I can import the entire csv using other MySQL clients makes me feel that the MySQLdb module is not configured right or something.
This is Python 2.7
Any help is appreciated. Any ideas on how to get better visibility into which line it's choking up on would be helpful.

To Further help I would ask you the following.
Error Checking
After your import using any of your three ways, are there any results from running this after each run? SELECT ##GLOBAL.SQL_WARNINGS; (if so this should show you the errors, as it might be silently failing.)
What is your SQL_MODE? SELECT ##GLOBAL.SQL_MODE;
Check the file and make sure you have an even number of "'s for one.
Check the data for extra " or ,'s or anything that may get caught in translation of bash/python/mysql?
Data Request
Can you provide the data for the 1st row that was missing?
Can you provide the exact script you are using?
Versions
You said your using python 2.7
What version of mysql client? SELECT ##GLOBAL.VERSION;
What version of MySQLdb?
Internationalization
Are you dealing with internationalization (汉语 Hànyǔ or русский etc. languages)?
What is the database/schema collation?
Query:
SELECT DISTINCT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME
FROM INFORMATION_SCHEMA.SCHEMATA
WHERE (
SCHEMA_NAME <> 'sys' AND
SCHEMA_NAME <> 'mysql' AND
SCHEMA_NAME <> 'information_schema' AND
SCHEMA_NAME <> '.mysqlworkbench' AND
SCHEMA_NAME <> 'performance_schema'
);
What is the Table collation?
Query:
SELECT DISTINCT ENGINE, TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES
WHERE (
TABLE_SCHEMA <> 'sys' AND
TABLE_SCHEMA <> 'mysql' AND
TABLE_SCHEMA <> 'information_schema' AND
TABLE_SCHEMA <> '.mysqlworkbench' AND
TABLE_SCHEMA <> 'performance_schema'
);
What is the column collation?
Query:
SELECT DISTINCT CHARACTER_SET_NAME, COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS
WHERE (
TABLE_SCHEMA <> 'sys' AND
TABLE_SCHEMA <> 'mysql' AND
TABLE_SCHEMA <> 'information_schema' AND
TABLE_SCHEMA <> '.mysqlworkbench' AND
TABLE_SCHEMA <> 'performance_schema'
);
Lastly
Check the Database
For connection collation/character_set
SHOW VARIABLES
WHERE VARIABLE_NAME LIKE 'CHARACTER\_SET\_%' OR
VARIABLE_NAME LIKE 'COLLATION%';
If the first two ways work without error then I'm leaning toward:
Other Plausible Concerns
I am not ruling out problems with any of the following:
possible python connection configuriation issues around
python to db connection collation
default connection timeout
default character set error
python/bash runtime interpolation of symbols causing a random hidden gem
db collation not set to handle foreign languages
exceeding the MAX(field values)
hidden or unicode characters
emoji processing
issues with the data as i mentioned above with Double-Quotes, Commas, and I forgot to mention about NewLines for Windows or Linux (Carriage return or NewLine)
All in all there is a lot to look at and require more information to further assist.
Please update your question when you have more information and I will do the same for my answer to help you resolve your error.
Hope this helps and all goes well!
Update:
Your Error
Warning: (1265L, "Data truncated for column
Leads me to believe it is the Double-Quote around your "field terminations" Check to make sure your data does NOT have commas inside of the errored out fields. This will cause your data to shift when running command-line. As the gui is "Smart-ENOUGH" per say to deal with this. but the command-line is literal!

This is an embarrassing one but maybe I can help someone in the future making horrible mistakes like I have.
I spent a lot of time analyzing fields, checking for special characters, etc and it turned out I was simply causing the problem myself.
I had spaces in the csv, and NOT using a forced ENCLOSED BY in the load statement. This means I was adding a space character to some fields thus causing an overflow. So the data looked like value1, value2, value3 when it should have been value1,value2,value3. Removing those spaces, putting quotes around the fields and enforcing ENCLOSED BY in my statement fixed this.
I assume that the clients that were working were sanitizing the data behind the scenes or something. I really don't know for sure why it was working elsewhere using the same csv but that got me through the first set of hurdles.
Then after getting through that, the last line in the csv was choking and it was stating Row doesn't contain data for all columns - turns out I didn't close() the file after creating it before attempting to load it. So there was some sort of lock on the file. Once I added the close() statement and fixed the spacing issue, all the data is loading now.
Sorry for anyone that spent any measure of time looking into this issue for me.

Incorrect string value error - Python + mariaDB

I am using the Python mysql-connector module to insert unicode character point 128049 (U+1F431) into a mariaDB sql table.
My SQL table is defined as:
show create table t1;
CREATE TABLE `t1` (
`c1` varchar(20) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
And the python code is:
import mysql.connector as db
conn = db.connect(sql_mode = 'STRICT_ALL_TABLES')
curs = conn.cursor(prepared = True)
curs.execute('insert into t1 (c1) values(%)', chr(128049))
Since this is a plane 1 unicode value it needs 4 bytes, but changing the table and column to utf8mb4 as suggested here didn't work.
The error I'm getting is:
Incorrect string value: '\xF0\x9F\x90\xB1' for column 'c1' at row 1
The string being inserted looks correct when compared to:
chr(128049).encode('utf-8')
The sql_mode for this version of mariadb is not strict by default. While the insert works when I do not specify strict mode, the characters are converted to the default '?' character.
I can't figure out how why SQL thinks this is an invalid string.
I am connecting to mariadb 10.1.9 via mysql-connector 2.1.4 in python 3.6.1.

The connection needs to specify utf8mb4. Or SET NAMES utf8mb4. This is to specify the encoding of the client's bytes.
🐱 is a 4-byte Emoji.
More Python tips: http://mysql.rjweb.org/doc.php/charcoll#python

Rick James answer is correct. From that I was able to create a solution that worked for me.
SET NAMES 'utf8mb4';
Sets 3 global variables as seen here. The only issue is this only sets session variables so you have to issue this command for every connection.
It doesn't appear possible to set those 3 variables in the mysqld group of the my.cnf file (I believe this is because they can not be set at the command line. Note the missing command line detail in the definitions here)
Instead I set the init_file option in the mysqld group of the my.cnf options file.
[mysqld]
init_file=/path/to/file.sql
Within that file I set the 3 variables:
set ##global.character_set_client='utf8mb4';
set ##global.character_set_connection='utf8mb4';
set ##global.character_set_results='utf8mb4';
Setting these globally forced the session variables to the same value. Problem solved.

Trying to save special characters in MySQL DB

I have a string that looks like this 🔴Use O Mozilla Que Não Trava! Testei! $vip ou $apoio
When I try to save it to my database with ...SET description = %s... and cursor.execute(sql, description) it gives me an error
Warning: (1366, "Incorrect string value: '\xF0\x9F\x94\xB4Us...' for column 'description' ...
Assuming this is an ASCII symbol, I tried description.decode('ascii') but this leads to
'str' object has no attribute 'decode'
How can I determine what encoding it is and how could I store anything like that to the database? The database is utf-8 encoded if that is important.
I am using Python3 and PyMySQL.
Any hints appreciated!

First, you need to make sure the table column has correct character set setting. If it is "latin1" you will not be able to store content that contains Unicode characters.
You can use following query to determine the column character set:
SELECT CHARACTER_SET_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA='your_database_name' AND TABLE_NAME='your_table_name' AND COLUMN_NAME='description'
Following Mysql document here if you want to change column character set.
Also, you need to make sure character set is properly configured for Mysql connection. Quoted from Mysql doc:
Character set issues affect not only data storage, but also
communication between client programs and the MySQL server. If you
want the client program to communicate with the server using a
character set different from the default, you'll need to indicate
which one. For example, to use the utf8 Unicode character set, issue
this statement after connecting to the server:
SET NAMES 'utf8';
Once character set setting is correct, you will be able to execute your sql statement. There is no need to encode / decode in Python side. That is used for different purposes.

pymssql returns different charset on Azure/Windows than on Mac

I have an sql server database hosted on Azure. I have put a string in the database with smart quotes('“test”'). I can connect to it and run a simple query:
import pymssql
import json
conn = pymssql.connect(
server='coconut.database.windows.net',
user='kingfish#coconut',
password='********',
database='coconut',
charset='UTF-8',
)
sql = """
SELECT * FROM messages WHERE id = '548a72cc-f584-7e21-2725-fe4dd594982f'
"""
cursor = conn.cursor()
cursor.execute(sql)
row = cursor.fetchone()
json.dumps(row[3])
When I run this query on my Mac (macOS 10.11.6, Python 3.4.4, pymssql 2.1.3) I get back the string:
"\u201ctest\u201d"
This is correctly interpreted as smart quotes and displays properly.
When I run this query on an Azure web deployment (Python 3.4, Azure App service) I get back a different (and incorrect) encoding for that same string:
"\u0093test\u0094"
I specified the charset as 'UTF-8' on the pymssql connection. Why does the Windows/Azure environment get back a different charset?
(note: I have put the pre-built binary pymssql-2.1.3-cp34-none-win32.whl in the wheelhouse of my project repo on Azure. This is the same as the pymssql pre-built binary pymssql-2.1.3-cp34-cp34m-win32.whl on PyPI only I had to rename the 'cp34m' to 'none' to convince pip to install it.)

According to your description, I think it seems that the issue was caused by the default charset encoding of the SQL Database on Azure. For verifing my thought, I did some testing below in Python 3.
The default charset encoding of SQL Database on Azure is Windows-1252 (CP-1252).
SQL Server Collation Support
The default database collation used by Microsoft Azure SQL Database is SQL_LATIN1_GENERAL_CP1_CI_AS, where LATIN1_GENERAL is English (United States), CP1 is code page 1252, CI is case-insensitive, and AS is accent-sensitive. It is not possible to alter the collation for V12 databases. For more information about how to set the collation, see COLLATE (Transact-SQL).
>>> u"\u201c".encode('cp1252')
b'\x93'
>>> u"\u201d".encode('cp1252')
b'\x94'
As the code above shown, the \u0093 & \u0094 can be got via encode \u201c & \u201d.
And,
>>> u"\u0093".encode('utf-8')
b'\xc2\x93'
>>> u"\u0093".encode('utf-8').decode('cp1252')[1]
'“' # It's `\u201c`
>>> u"\u201c" == u"\u0093".encode('utf-8').decode('cp1252')[1]
True
So I think the charset encoding of your current SQL Database for data storage is Latin-1, not UTF-8, when you created the SQL Database, as the figure below, the default property Collation on Azure portal is SQL_Latin1_General_CP1_CI_AS. Please try to use the other collation support UTF-8 instead of the default one.

I ended up recasting the column type from VARCHAR to NVARCHAR. This solved my problem, characters are correctly interpreted, regardless of platform.

using pyodbc on linux to insert unicode or utf-8 chars in a nvarchar mssql field

I am using Ubuntu 9.04
I have installed the following package versions:
unixodbc and unixodbc-dev: 2.2.11-16build3
tdsodbc: 0.82-4
libsybdb5: 0.82-4
freetds-common and freetds-dev: 0.82-4
I have configured /etc/unixodbc.ini like this:
[FreeTDS]
Description = TDS driver (Sybase/MS SQL)
Driver = /usr/lib/odbc/libtdsodbc.so
Setup = /usr/lib/odbc/libtdsS.so
CPTimeout =
CPReuse =
UsageCount = 2
I have configured /etc/freetds/freetds.conf like this:
[global]
tds version = 8.0
client charset = UTF-8
I have grabbed pyodbc revision 31e2fae4adbf1b2af1726e5668a3414cf46b454f from http://github.com/mkleehammer/pyodbc and installed it using "python setup.py install"
I have a windows machine with Microsoft SQL Server 2000 installed on my local network, up and listening on the local ip address 10.32.42.69. I have an empty database created with name "Common". I have the user "sa" with password "secret" with full priviledges.
I am using the following python code to setup the connection:
import pyodbc
odbcstring = "SERVER=10.32.42.69;UID=sa;PWD=secret;DATABASE=Common;DRIVER=FreeTDS"
con = pyodbc.connect(s)
cur = con.cursor()
cur.execute('''
CREATE TABLE testing (
id INTEGER NOT NULL IDENTITY(1,1),
name NVARCHAR(200) NULL,
PRIMARY KEY (id)
)
''')
con.commit()
Everything WORKS up to this point. I have used SQLServer's Enterprise Manager on the server and the new table is there.
Now I want to insert some data on the table.
cur = con.cursor()
cur.execute('INSERT INTO testing (name) VALUES (?)', (u'something',))
That fails!! Here's the error I get:
pyodbc.Error: ('HY004', '[HY004] [FreeTDS][SQL Server]Invalid data type
(0) (SQLBindParameter)'
Since my client is configured to use UTF-8 I thought I could solve by encoding data to UTF-8. That works, but then I get back strange data:
cur = con.cursor()
cur.execute('DELETE FROM testing')
cur.execute('INSERT INTO testing (name) VALUES (?)', (u'somé string'.encode('utf-8'),))
con.commit()
# fetching data back
cur = con.cursor()
cur.execute('SELECT name FROM testing')
data = cur.fetchone()
print type(data[0]), data[0]
That gives no error, but the data returned is not the same data sent! I get:
<type 'unicode'> somÃ© string
That is, pyodbc won't accept an unicode object directly, but it returns unicode objects back to me! And the encoding is being mixed up!
Now for the question:
I want code to insert unicode data in a NVARCHAR and/or NTEXT field. When I query back, I want the same data I inserted back.
That can be by configuring the system differently, or by using a wrapper function able to convert the data correctly to/from unicode when inserting or retrieving
That's not asking much, is it?

I can remember having this kind of stupid problems using odbc drivers, even if that time it was a java+oracle combination.
The core thing is that odbc driver apparently encodes the query string when sending it to the DB. Even if the field is Unicode, and if you provide Unicode, in some cases it does not seem to matter.
You need to ensure that what is sent by the driver has the same encoding as your Database (not only server, but also database). Otherwise, of course you get funky characters because either the client or the server is mixing things up when encoding/or decoding. Do you have any idea of the charset (codepoint as MS like to say) that your server is using as a default for decoding data?
Collation has nothing to do with this problem :)
See that MS page for example. For Unicode fields, collation is used only to define the sort order in the column, not to specify how the data is stored.
If you store your data as Unicode, there is an Unique way to represent it, that's the purpose of Unicode: no need to define a charset that is compatible with all the languages that you are going to use :)
The question here is "what happens when I give data to the server that is not Unicode?". For example:
When I send an UTF-8 string to the server, how does it understand it?
When I send an UTF-16 string to the server, how does it understand it?
When I send a Latin1 string to the server, how does it understand it?
From the server perspective, all these 3 strings are only a stream of bytes. The server cannot guess the encoding in which you encoded them. Which means that you will get troubles if your odbc client ends up sending bytestrings (an encoded string) to the server instead of sending unicode data: if you do so, the server will use a predefined encoding (that was my question: what encoding the server will use? Since it is not guessing, it must be a parameter value), and if the string had been encoded using a different encoding, dzing, data will get corrupted.
It's exactly similar as doing in Python:
uni = u'Hey my name is André'
in_utf8 = uni.encode('utf-8')
# send the utf-8 data to server
# send(in_utf8)
# on server side
# server receives it. But server is Japanese.
# So the server treats the data with the National charset, shift-jis:
some_string = in_utf8 # some_string = receive()
decoded = some_string.decode('sjis')
Just try it. It's fun. The decoded string is supposed to be "Hey my name is André", but is "Hey my name is Andrﾃｩ". é gets replaced by Japanese ﾃｩ
Hence my suggestion: you need to ensure that pyodbc is able to send directly the data as Unicode. If pyodbc fails to do this, you will get unexpected results.
And I described the problem in the Client to Server way. But the same sort of issues can arise when communicating back from the Server to the Client. If the Client cannot understand Unicode data, you'll likely get into troubles.
FreeTDS handles Unicode for you.
Actually, FreeTDS takes care of things for you and translates all the data to UCS2 unicode. (Source).
Server <--> FreeTDS : UCS2 data
FreeTDS <--> pyodbc : encoded strings, encoded in UTF-8 (from /etc/freetds/freetds.conf)
So I would expect your application to work correctly if you pass UTF-8 data to pyodbc. In fact, as this django-pyodbc ticket states, django-pyodbc communicates in UTF-8 with pyodbc, so you should be fine.
FreeTDS 0.82
However, cramm0 says that FreeTDS 0.82 is not completely bugfree, and that there are significant differences between 0.82 and the official patched 0.82 version that can be found here. You should probably try using the patched FreeTDS
Edited: removed old data, which had nothing to do with FreeTDS but was only relevant to Easysoft commercial odbc driver. Sorry.

I use UCS-2 to interact with SQL Server, not UTF-8.
Correction: I changed the .freetds.conf entry so that the client uses UTF-8
tds version = 8.0
client charset = UTF-8
text size = 32768
Now, bind values work fine for UTF-8 encoded strings.
The driver converts transparently between the UCS-2 used for storage on the dataserver side and the UTF-8 encoded strings given to/taken from the client.
This is with pyodbc 2.0 on Solaris 10 running Python 2.5 and FreeTDS freetds-0.82.1.dev.20081111 and SQL Server 2008
import pyodbc
test_string = u"""Comment ça va ? Très bien ?"""
print type(test_string),repr(test_string)
utf8 = 'utf8:' + test_string.encode('UTF-8')
print type(utf8), repr(utf8)
c = pyodbc.connect('DSN=SA_SQL_SERVER_TEST;UID=XXX;PWD=XXX')
cur = c.cursor()
# This does not work as test_string is not UTF-encoded
try:
cur.execute('INSERT unicode_test(t) VALUES(?)', test_string)
c.commit()
except pyodbc.Error,e:
print e
# This one does:
try:
cur.execute('INSERT unicode_test(t) VALUES(?)', utf8)
c.commit()
except pyodbc.Error,e:
print e
Here is the output from the test table (I had manually put in a bunch of test data via Management Studio)
In [41]: for i in cur.execute('SELECT t FROM unicode_test'):
....: print i
....:
....:
('this is not a banana', )
('\xc3\x85kergatan 24', )
('\xc3\x85kergatan 24', )
('\xe6\xb0\xb4 this is code-point 63CF', )
('Mich\xc3\xa9l', )
('Comment a va ? Trs bien ?', )
('utf8:Comment \xc3\xa7a va ? Tr\xc3\xa8s bien ?', )
I was able to put in some in unicode code points directly into the table from Management Studio by the 'Edit Top 200 rows' dialog and entering the hex digits for the unicode code point and then pressing Alt-X

I had the same problem when trying to bind unicode parameter:
'[HY004] [FreeTDS][SQL Server]Invalid data type (0) (SQLBindParameter)'
I solved it by upgrading freetds to version 0.91.
I use pyodbc 2.1.11. I had to apply this patch to make it work with unicode, otherwise I was getting memory corruption errors occasionally.

Are you sure it's INSERT that's causing problem not reading?
There's a bug open on pyodbc Problem fetching NTEXT and NVARCHAR data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.