Character encoding and decoding in Python with MySQL

Character encoding and decoding in Python with MySQL - python

For query:
SHOW VARIABLES LIKE 'char%';
MySQL Database returns:
character_set_client latin1
character_set_connection latin1
character_set_database latin1
character_set_filesystem binary
character_set_results latin1
character_set_server latin1
character_set_system utf8
character_sets_dir /usr/local/mysql-5.7.27-macos10.14-x86_64/share/charsets/
In my Python script:
conn = get_database_connection()
conn.setdecoding(pyodbc.SQL_CHAR, encoding='latin1')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='latin1')
For one of the columns that has following value:
N’a pas
Python returns:
N?a pas
Between N and a, There is a star shaped question-mark. How do I read it as is? What's the best way to handle it? I have been reading about converting my db to utf-8 but that seems like a long shot with a good chance of breaking other things. Is there a more efficient way to do it?
At some of the places in code, I have done :
value = value.encode('utf-8', 'ignore').decode('utf-8')
to handle utf-8 data like accented characters but apostrophe did not get handled with the same and I ended up with ? instead of '

Converting the database to UTF-8 is better for the long run, but risky because you may break other things like you say. What you can do is change the database connection encoding to UTF-8. That way you get UTF-8 encoded strings out of the database, without having changed how the data is actually stored.
conn.setdecoding(pyodbc.SQL_CHAR, encoding='utf8')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf8')
If that seems too risky, but you could consider having two separate database connections, the original and one in utf8, and migrate the app to using utf8 little by little, as you have time to test.
If even that seems too risky, maybe try using a character encoding that's more similar to mysql's version of latin1. MySQL's "latin1" is actually an extended version of cp1252 encoding, which itself is a Microsoft extension of the "standard latin1" that's used in Python (among others).
conn.setdecoding(pyodbc.SQL_CHAR, encoding='cp1252')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='cp1252')

Don't use any form of encoding/decoding; it only complicates your code and hides more errors. In fact, you may be trying to "make two wrongs make a right".
Go with utf8 (or utf8mb4).
Notes on "question mark": Trouble with UTF-8 characters; what I see is not what I stored
Notes on Python: http://mysql.rjweb.org/doc.php/charcoll#python

Related

Python MySQL CSV export to json strange encoding

I received a csv file exported from a MySQL database (I think the encoding is latin1 since the language is spanish). Unfortunately the encoding is wrong and I cannot process it at all. If I use file:
$ file -I file.csv
file.csv: text/plain; charset=unknown-8bit
I have tried to read the file in python and convert it to utf-8 like:
r.decode('latin-1').encode("utf-8")
or using mysql_latin1_codec:
r.decode('mysql_latin1').encode('UTF-8')
I am trying to transform the data into json objects. The error comes when I save the file:
'UnicodeEncodeError: 'ascii' codec can't encode characters in position'
Do you know how can I convert it to normal utf-8 chars? Or how can I convert data to a valid json? Thanks!!

I got really good results by using pandas dataframe from Continuum Analytics.
You coud do something like:
import pandas as pd
from pandas import *
con='Your database connection credentials user, password, host, database to use'
data=pd.read_sql_query('SELECT * FROM YOUR TABLE',conn=con)
Then you could do:
data.to_csv('path_with_file_name')
or to convert to JSON:
data.to_json(orient='records')
or if you prefer to customize your json format see the documentation here:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html

Have you tried using the codecs module?:
import codecs
....
codecs.EncodedFile(r, 'latin1').reader.read()
I remember having a similar issue a while back and the answer was something to do with how encoding was done prior to Python 3. Codecs seems to handle this problem relatively elegantly.
As coder mentioned in the question comments, it's difficult to pinpoint the problem without being able to reproduce it so I may be barking up the wrong tree.

You probably have two problems. But let's back off... We can't tell whether the text was imported incorrectly, exported incorrectly, or merely displayed in a goofy way.
First, I am going to discuss "importing"...
Do not try to alter the encoding. Instead live with the encoding. But first, figure out what the encoding is. It could be latin1 or it could be utf8. (Or any of lots of less likely charsets.)
Find out the hex for the incoming file. In Python, the code is something like this for dumping hex (etc) for string u:
for i, c in enumerate(u):
print i, '%04x' % ord(c), unicodedata.category(c),
print unicodedata.name(c)
You can go here to see a list of hex values for all the latin1 characters, together with the utf8 hex. For example, ó is latin1 F3 or utf8 C2B3.
Now, armed with knowing the encoding, tell MySQL that.
LOAD DATA INFILE ...
...
CHARACTER SET utf8 -- or latin1
...;
Meanwhile, it does not matter what CHARACTER SET ... the table or column is defined to be; mysql will transcode if necessary. All Spanish characters are available in latin1 and utf8.
Go to this Q&A .
I suggested that you have two errors, one is the "black diamond" case mentioned there; there other is something else. But... Follow the "Best Practice" mentioned.
Back to you question of "exporting"...
Again, you need to check the hex of the output file. Again it does not matter whether it is latin1 or utf8. However... If the hex is C383C2B3 for simply ó, you have "double encoding". If you have that, check to see that you have removed any manual conversion function calls, and simply told MySQL what's what.
Here are some more utf8+Python tips you might need.
If you need more help, follow the text step-by-step. Show us the code used to move/convert it at each step, and show us the HEX at each step.

Python with MySql unicode problems

I need to call MySQL stored procedure from my python script. As one of parameters I'm passing a unicode string (Russian language), but I get an error;
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)
My script:
self.db=MySQLdb.connect("localhost", "usr", "pass", "dbName")
self.cursor=self.db.cursor()
args=("какой-то текст") #this is string in russian
self.cursor.callproc('pr_MyProc', args)
self.cursor.execute('SELECT #_pr_MyProc_2') #getting result from sp
result=self.cursor.fetchone()
self.db.commit()
I've read that setting charset='utf8' shuld resolve this problem, but when I use string:
self.db=MySQLdb.connect("localhost", "usr", "pass", "dbName", charset='utf8')
This gives me another error;
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd1' in position 20: surrogates not allowed
Also I've trying to set parametr use_unicode=True, that's not working.

More things to check on:
http://mysql.rjweb.org/doc.php/charcoll#python
Likely items:
Start code file with # -*- coding: utf-8 -*- -- (for literals in code)
Literals should be u'...'
Can you extract the HEX? какой-то текст should be this in utf8: D0BA D0B0 D0BA D0BE D0B9 2D D182 D0BE D182 20 D0B5 D0BA D181 D182

Here are some thoughts. Maybe not a response. I've been playing with python/mysql/utf-8/unicode in the past and this is the things i remember:
Looking at Saltstack mysql module's comment :
https://github.com/saltstack/salt/blob/develop/salt/modules/mysql.py#L314-L322
# MySQLdb states that this is required for charset usage
# but in fact it's more than it's internally activated
# when charset is used, activating use_unicode here would
# retrieve utf8 strings as unicode() objects in salt
# and we do not want that.
#_connarg('connection_use_unicode', 'use_unicode')
connargs['use_unicode'] = False
_connarg('connection_charset', 'charset')
We see that to avoid altering the result string the use_unicode is set to False, while the charset (which could be utf-8) is set as a parameter. use_unicode is more a 'request' to get responses as unicode strings.
You can check real usage in the tests, here:
https://github.com/saltstack/salt/blob/develop/tests/integration/modules/test_mysql.py#L311-L361 with a database named '標準語'.
Now about the message UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd1' **. You are using **unicode but you tell the module it is utf-8. It is not utf-8 until you encode your unicode string in utf-8.
Maybe you should try with:
args=(u"какой-то текст".encode('utf-8'))
At least in python3 this is required, because your "какой-то текст" is not in utf-8 by default.

The MySQLdb module is not compatible with python 3. That might be why you are getting problems. I would advise to use a different connector, like PyMySQL or mysqlclient.
Related: 23376103.

Maybe you can reload your sys in utf-8 and try to decode the string into utf-8 as following :
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
...
stringUtf8 = u''.join(string_original).decode('utf-8')

I had a similar problem very recently but with PostgreSQL. After trying tons of suggestions from SO/ internet, I realized the issue was with my database. I had to drop my database and reinstall Postgres, because for some reason it was not allowing me to change the database's default collation. I was in a hurry so couldn't find a better solution, but would recommend the same, since I was only starting my application in the deployment environment.
All the best.

What's your database's charset?
use :
show variables like "characetr%";
or see your database's charset

I see here two problems.
You have unicode but you try to define it as utf-8 by setting parameter "charset". You should first encode your unicode to utf-8 or another encoding system.
If it however doesn't work, try to do so with init_command='SET NAMES UTF8' parameter.
So it will look like:
conn = MySQLdb.connect(charset='utf8', init_command='SET NAMES UTF8')
You can try also this:
cursor = db.cursor()
cursor.execute("SET NAMES UTF8;")

I encountered a similar issue, which was caused by invalid utf-8 data in the database; it seems that MySQL doesn't care about that, but Python does, because it's following the UTF-8 spec, which says that:
surrogate pairs are not allowed in utf-8
unpaired surrogates are not allowed in utf-8
If you want to "make it work", you'll have to intercept the MySQL packet and use your own converter which will perform ad-hoc replacements.
Here's one way to "handle" invalid data containing surrogates:
def borked_utf8_decode(data):
"""
Work around input with unpaired surrogates or surrogate pairs,
replacing by XML char refs: look for "&#\d+;" after.
"""
return data.decode("utf-8", "surrogatepass") \
.encode("utf-8", "xmlcharrefreplace") \
.decode("utf-8")
Note that the proper way to handle that is context-dependent, but there are some common replacement scenarios, like this one.
And here's one way of plugging this into pymysql (another way is to monkey-patch field processing, see eg. https://github.com/PyMySQL/PyMySQL/issues/631):
import pymysql.converters
# use this in your connection
pymysql_use_unicode = False
conversions = pymysql.converters.conversions
conversions[pymysql.converters.FIELD_TYPE.STRING] = borked_utf8_decode
conversions[pymysql.converters.FIELD_TYPE.VAR_STRING] = borked_utf8_decode
conversions[pymysql.converters.FIELD_TYPE.VARCHAR] = borked_utf8_decode

Curious about unicode / string encoding in Python 3

I'd like to ask why something works which I have found after painful hours of reading/trying to understand and in the end simply succesfull trial-and-error...
I'm on Linux (Ubuntu 13.04, German time formats etc., but english system language). My small python 3 script connects to a sqlite3 database of the reference manager Zotero. There I read a couple of keys with the goal of exporting files from the zotero storage directory (probably not important, and as said above, got it working).
All of this works fine with characters in the ascii set, but of course there are a lot of international authors in the database and my code used to fail on non-ascii authors/paper titles.
Perhaps first some info about the database on command line sqlite3:
sqlite3 zotero-test.sqlite
SQLite version 3.7.15.2 2013-01-09 11:53:05
sqlite> PRAGMA encoding;
UTF-8
Exemplary problematic entry:
sqlite> select * from itemattachments;
317|281|1|application/pdf|5|storage:MÃ¼ller-Forell_2008_Orbitatumoren.pdf||2|1372357574000|2814ef3ea9c50cce2c32d6fb46b977bb
The correct name would be "storage:Müller-Forrell"; Zotero itself decodes this correctly, but SQLIte does not (at least dos not output it correctly in my terminal).
Google tells me that "Ã¼" is a somehow incorrectly or not decoded latin-1/8859-1 "ü".
Reading this database entry in from python3 with
connection = sqlite3.connect("zotero-test.sqlite")`
cursor = connection.cursor()`
cursor.execute("SELECT itemattachments.itemID,itemattachments.sourceItemID,itemattachments.path,items.key FROM itemattachments,items WHERE mimetype=\"application/pdf\" AND items.itemID=itemattachments.itemID")
for pdf_result in cursor:
print(pdf_result[2])
print()
print(pdf_result[2].encode("latin-1").decode("utf-8"))
gives:
storage:MÃ¼ller-Forell_2008_Orbitatumoren.pdf
storage:Müller-Forell_2008_Orbitatumoren.pdf
, the second being correct, so I got my script working (gosh how many hours this cost me...)
Can somebody explain to me what this construction of .encode and .decode does? Which one is even executed first?
Thanks for any clues,
Joost

The cursor yields strs. We run encode() on it to convert it to a bytes, and then decode it back into a str. It sounds like the data in the database is misencoded.

What you're seeing here is UTF8 data encoded in latin-1 stored in the SQLite database.
The sqlite module always returns unicode strings, so you first have to encode them into a unicode equivalent of latin-1 and then decode them as UTF8.
They shouldn't have been stored in the db as latin-1 to begin with.
You are executing encode before decode.

UnicodeEncodeError: 'latin-1' codec can't encode character

What could be causing this error when I try to insert a foreign character into the database?
>>UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in position 0: ordinal not in range(256)
And how do I resolve it?
Thanks!

I ran into this same issue when using the Python MySQLdb module. Since MySQL will let you store just about any binary data you want in a text field regardless of character set, I found my solution here:
Using UTF8 with Python MySQLdb
Edit: Quote from the above URL to satisfy the request in the first comment...
"UnicodeEncodeError:'latin-1' codec can't encode character ..."
This is because MySQLdb normally tries to encode everythin to latin-1.
This can be fixed by executing the following commands right after
you've etablished the connection:
db.set_character_set('utf8')
dbc.execute('SET NAMES utf8;')
dbc.execute('SET CHARACTER SET utf8;')
dbc.execute('SET character_set_connection=utf8;')
"db" is the result of MySQLdb.connect(), and "dbc" is the result of
db.cursor().

Character U+201C Left Double Quotation Mark is not present in the Latin-1 (ISO-8859-1) encoding.
It is present in code page 1252 (Western European). This is a Windows-specific encoding that is based on ISO-8859-1 but which puts extra characters into the range 0x80-0x9F. Code page 1252 is often confused with ISO-8859-1, and it's an annoying but now-standard web browser behaviour that if you serve your pages as ISO-8859-1, the browser will treat them as cp1252 instead. However, they really are two distinct encodings:
>>> u'He said \u201CHello\u201D'.encode('iso-8859-1')
UnicodeEncodeError
>>> u'He said \u201CHello\u201D'.encode('cp1252')
'He said \x93Hello\x94'
If you are using your database only as a byte store, you can use cp1252 to encode “ and other characters present in the Windows Western code page. But still other Unicode characters which are not present in cp1252 will cause errors.
You can use encode(..., 'ignore') to suppress the errors by getting rid of the characters, but really in this century you should be using UTF-8 in both your database and your pages. This encoding allows any character to be used. You should also ideally tell MySQL you are using UTF-8 strings (by setting the database connection and the collation on string columns), so it can get case-insensitive comparison and sorting right.

The best solution is
set mysql's charset to 'utf-8'
do like this comment(add use_unicode=True and charset="utf8")
db = MySQLdb.connect(host="localhost", user = "root", passwd = "", db = "testdb", use_unicode=True, charset="utf8") – KyungHoon Kim Mar
13 '14 at 17:04
detail see :
class Connection(_mysql.connection):
"""MySQL Database Connection Object"""
default_cursor = cursors.Cursor
def __init__(self, *args, **kwargs):
"""
Create a connection to the database. It is strongly recommended
that you only use keyword parameters. Consult the MySQL C API
documentation for more information.
host
string, host to connect
user
string, user to connect as
passwd
string, password to use
db
string, database to use
port
integer, TCP/IP port to connect to
unix_socket
string, location of unix_socket to use
conv
conversion dictionary, see MySQLdb.converters
connect_timeout
number of seconds to wait before the connection attempt
fails.
compress
if set, compression is enabled
named_pipe
if set, a named pipe is used to connect (Windows only)
init_command
command which is run once the connection is created
read_default_file
file from which default client values are read
read_default_group
configuration group to use from the default file
cursorclass
class object, used to create cursors (keyword only)
use_unicode
If True, text-like columns are returned as unicode objects
using the connection's character set. Otherwise, text-like
columns are returned as strings. columns are returned as
normal strings. Unicode objects will always be encoded to
the connection's character set regardless of this setting.
charset
If supplied, the connection character set will be changed
to this character set (MySQL-4.1 and newer). This implies
use_unicode=True.
sql_mode
If supplied, the session SQL mode will be changed to this
setting (MySQL-4.1 and newer). For more details and legal
values, see the MySQL documentation.
client_flag
integer, flags to use or 0
(see MySQL docs or constants/CLIENTS.py)
ssl
dictionary or mapping, contains SSL connection parameters;
see the MySQL documentation for more details
(mysql_ssl_set()). If this is set, and the client does not
support SSL, NotSupportedError will be raised.
local_infile
integer, non-zero enables LOAD LOCAL INFILE; zero disables
autocommit
If False (default), autocommit is disabled.
If True, autocommit is enabled.
If None, autocommit isn't set and server default is used.
There are a number of undocumented, non-standard methods. See the
documentation for the MySQL C API for some hints on what they do.
"""

I hope your database is at least UTF-8. Then you will need to run yourstring.encode('utf-8') before you try putting it into the database.

Use the below snippet to convert the text from Latin to English
import unicodedata
def strip_accents(text):
return "".join(char for char in
unicodedata.normalize('NFKD', text)
if unicodedata.category(char) != 'Mn')
strip_accents('áéíñóúü')
output:
'aeinouu'

You are trying to store a Unicode codepoint \u201c using an encoding ISO-8859-1 / Latin-1 that can't describe that codepoint. Either you might need to alter the database to use utf-8, and store the string data using an appropriate encoding, or you might want to sanitise your inputs prior to storing the content; i.e. using something like Sam Ruby's excellent i18n guide. That talks about the issues that windows-1252 can cause, and suggests how to process it, plus links to sample code!

SQLAlchemy users can simply specify their field as convert_unicode=True.
Example:
sqlalchemy.String(1000, convert_unicode=True)
SQLAlchemy will simply accept unicode objects and return them back, handling the encoding itself.
Docs

Latin-1 (aka ISO 8859-1) is a single octet character encoding scheme, and you can't fit \u201c (“) into a byte.
Did you mean to use UTF-8 encoding?

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 106: ordinal not in range(256)
Solution 1:
\u2013 - google the character meaning to identify what character actually causing this error, Then you can replace that specific character, in the string with some other character, that's part of the encoding you are using.
Solution 2:
Change the string encoding to some encoding which includes all the character of your string. and then you can print that string, it will work just fine.
below code is used to change encoding of the string , borrowed from #bobince
u'He said \u201CHello\u201D'.encode('cp1252')

The latest version of mysql.connector has only
db.set_charset_collation('utf8', 'utf8_general_ci')
and NOT
db.set_character_set('utf8') //This feature is not available

I ran into the same problem when I was using PyMySQL. I checked this package version, it's 0.7.9.
Then I uninstall it and reinstall PyMySQL-1.0.2, the issue is solved.
pip uninstall PyMySQL
pip install PyMySQL

Python: You will need to add
# - * - coding: UTF-8 - * - (remove the spaces around * )
to the first line of the python file. and then add the following to the text to encode: .encode('ascii', 'xmlcharrefreplace'). This will replace all the unicode characters with it's ASCII equivalent.

What is the correct procedure to store a utf-16 encoded rss stream into sqlite3 using python

I have a python sgi script that attempts to extract an rss items that is posted to it and store the rss in a sqlite3 db. I am using flup as the WSGIServer.
To obtain the posted content:
postData = environ["wsgi.input"].read(int(environ["CONTENT_LENGTH"]))
To attempt to store in the db:
from pysqlite2 import dbapi2 as sqlite
ldb = sqlite.connect("/var/vhost/mysite.com/db/rssharvested.db")
lcursor = ldb.cursor()
lcursor.execute("INSERT into rss(data) VALUES(?)", (postData,))
This results in only the first few characters of the rss being stored in the record:
ÿþ<
I believe the initial chars are the BOM of the rss.
I have tried every permutation I could think of including first encoding rss as utf-8 and then attempting to store but the results were the same. I could not decode because some characters could not be represented as unicode.
Running python 2.5.2
sqlite 3.5.7
Thanks in advance for any insight into this problem.
Here is a sample of the initial data contained in postData as modified by the repr function, written to a file and viewed with less:
'\xef\xbb\xbf
Thanks for the all the replies! Very helpful.
The sample I submitted didn't make it through the stackoverflow html filters will try again, converting less and greater than to entities (preview indicates this works).
\xef\xbb\xbf<?xml version="1.0" encoding="utf-16"?><rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><channel><item d3p1:size="0" xsi:type="tFileItem" xmlns:d3p1="http://htinc.com/opensearch-ex/1.0/">

Regarding the insertion encoding - in any decent database API, you should insert unicode strings and unicode strings only.
For the reading and parsing bit, I'd recommend Mark Pilgrim's Feed Parser. It properly handles BOM, and the license allows commercial use. This may be a bit too heavy handed if you are not doing any actual parsing on the RSS data.

Are you sure your incoming data are encoded as UTF-16 (otherwise known as UCS-2)?
UTF-16 encoded unicode strings typically include lots of NUL characters (surely for all characters existing in ASCII too), so UTF-16 data hardly can be stored in environment variables (env vars in POSIX are NUL terminated).
Please provide samples of the postData variable contents. Output them using repr().
Until then, the solid advice is: in all DB interactions, your strings on the Python side should be unicode strings; the DB interface should take care of all translations/encodings/decodings necessary.

Before the SQL insertion you should to convert the string to unicode compatible strings. If you raise an UnicodeError exception, then encode the string.encode("utf-8").
Or , you can autodetect encoding and encode it , on his encode schema. Auto detect encoding

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.