I want to use a API from a game and store the player and clan names in a local database. The names can contain all sorts of characters and emoticons. Here are just a few examples I found:
⭐💎
яαℓαηι
نکل
窝猫
鐵擊道遊隊
❤✖❤♠️♦️♣️✖
I use python for reading the api and write it into a mysql database. After that, I want to use the names on a Node.js web application.
What is the best way to encode those characters and how can I savely store them in the database, so that I can display them correcly afterwards?
I tried to encode the strings in python with utf-8:
>>> sample = '蛙喜鄉民CLUB'
>>> sample
'蛙喜鄉民CLUB'
>>> sample = sample.encode('UTF-8')
>>> sample
b'\xe8\x9b\x99\xe5\x96\x9c\xe9\x84\x89\xe6\xb0\x91CLUB'
and storing the encoded string in a mysql database with utf8mb4_unicode_ci character set.
When I store the string from above and select it inside mysql workbench it is displayed like this:
蛙喜鄉民CLUB
When I read this string from the database again in python (and store it in db_str) I get:
>>> db_str
èåéæ°CLUB
>>> db_str.encode('UTF-8')
b'\xc3\xa8\xc2\x9b\xc2\x99\xc3\xa5\xc2\x96\xc2\x9c\xc3\xa9\xc2\x84\xc2\x89\xc3\xa6\xc2\xb0\xc2\x91CLUB'
The first output is total gibberish, the second one with utf-8 looks mostly like the encoded string from above, but with added \xc2 or \xc3 between each byte.
How can I save such strings into mysql, so that I can read them again and display them correctly inside a python script?
Is my database collation utf8mb4_unicode_ci not suitable for such content? Or do I have to use another encoding?
As described by #abarnert in a comment to the question, the problem was that the library used for written the unicode strings didn't know that utf-8 should be used and therefor encoded the strings wrong.
After adding charset='utf8mb4' as parameter to the mysql connection the string get written correctly in the intended encoding.
All I had to change was
conn = MySQLdb.connect(host, user, pass, db, port)
to
conn = MySQLdb.connect(host, user, pass, db, port, charset='utf8mb4')
and after that my approach described in the question worked flawlessly.
edit: after declaring the charset='utf8mb4' parameter on the connection object it is no longer necessary to encode the strings, as that gets now already successfully done by the mysqlclient library.
I'd like to ask why something works which I have found after painful hours of reading/trying to understand and in the end simply succesfull trial-and-error...
I'm on Linux (Ubuntu 13.04, German time formats etc., but english system language). My small python 3 script connects to a sqlite3 database of the reference manager Zotero. There I read a couple of keys with the goal of exporting files from the zotero storage directory (probably not important, and as said above, got it working).
All of this works fine with characters in the ascii set, but of course there are a lot of international authors in the database and my code used to fail on non-ascii authors/paper titles.
Perhaps first some info about the database on command line sqlite3:
sqlite3 zotero-test.sqlite
SQLite version 3.7.15.2 2013-01-09 11:53:05
sqlite> PRAGMA encoding;
UTF-8
Exemplary problematic entry:
sqlite> select * from itemattachments;
317|281|1|application/pdf|5|storage:Müller-Forell_2008_Orbitatumoren.pdf||2|1372357574000|2814ef3ea9c50cce2c32d6fb46b977bb
The correct name would be "storage:Müller-Forrell"; Zotero itself decodes this correctly, but SQLIte does not (at least dos not output it correctly in my terminal).
Google tells me that "ü" is a somehow incorrectly or not decoded latin-1/8859-1 "ü".
Reading this database entry in from python3 with
connection = sqlite3.connect("zotero-test.sqlite")`
cursor = connection.cursor()`
cursor.execute("SELECT itemattachments.itemID,itemattachments.sourceItemID,itemattachments.path,items.key FROM itemattachments,items WHERE mimetype=\"application/pdf\" AND items.itemID=itemattachments.itemID")
for pdf_result in cursor:
print(pdf_result[2])
print()
print(pdf_result[2].encode("latin-1").decode("utf-8"))
gives:
storage:Müller-Forell_2008_Orbitatumoren.pdf
storage:Müller-Forell_2008_Orbitatumoren.pdf
, the second being correct, so I got my script working (gosh how many hours this cost me...)
Can somebody explain to me what this construction of .encode and .decode does? Which one is even executed first?
Thanks for any clues,
Joost
The cursor yields strs. We run encode() on it to convert it to a bytes, and then decode it back into a str. It sounds like the data in the database is misencoded.
What you're seeing here is UTF8 data encoded in latin-1 stored in the SQLite database.
The sqlite module always returns unicode strings, so you first have to encode them into a unicode equivalent of latin-1 and then decode them as UTF8.
They shouldn't have been stored in the db as latin-1 to begin with.
You are executing encode before decode.
i m working with some python script, got a raw string with UTF8 encoding. first of all i decoded it to utf8 then some processing is done and at the end i encode it back to utf8 and inserted to DB(mysql) but chars in DB are not presented in real format.
str = '<term>Beiträge</term>'
str = str.decode('utf8')
...
...
...
str = str.encode('utf8')
after that string is found in txt file in its real form but in MYSQL_DB, i found it like this
<term>"Beiträge</term>
any idea why this happened? :-(
Assuming you are using the MySQLdb library, you need to create connections using the keyword arguments:
use_unicode
If True, text-like columns are returned as unicode objects using the
connection's character set. Otherwise,
text-like columns are returned as
strings. columns are returned as
normal strings. Unicode objects will
always be encoded to the connection's
character set regardless of this
setting.
&
charset
If supplied, the connection character set will be changed to this
character set (MySQL-4.1 and newer).
This implies use_unicode=True.
You should also check the encoding of your db tables.
To make a string a Unicode string you should use the stringprefix 'u'. See also here http://docs.python.org/reference/lexical_analysis.html#literals
Maybe your example works by just adding the prefix in the initial assignment.
What could be causing this error when I try to insert a foreign character into the database?
>>UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in position 0: ordinal not in range(256)
And how do I resolve it?
Thanks!
I ran into this same issue when using the Python MySQLdb module. Since MySQL will let you store just about any binary data you want in a text field regardless of character set, I found my solution here:
Using UTF8 with Python MySQLdb
Edit: Quote from the above URL to satisfy the request in the first comment...
"UnicodeEncodeError:'latin-1' codec can't encode character ..."
This is because MySQLdb normally tries to encode everythin to latin-1.
This can be fixed by executing the following commands right after
you've etablished the connection:
db.set_character_set('utf8')
dbc.execute('SET NAMES utf8;')
dbc.execute('SET CHARACTER SET utf8;')
dbc.execute('SET character_set_connection=utf8;')
"db" is the result of MySQLdb.connect(), and "dbc" is the result of
db.cursor().
Character U+201C Left Double Quotation Mark is not present in the Latin-1 (ISO-8859-1) encoding.
It is present in code page 1252 (Western European). This is a Windows-specific encoding that is based on ISO-8859-1 but which puts extra characters into the range 0x80-0x9F. Code page 1252 is often confused with ISO-8859-1, and it's an annoying but now-standard web browser behaviour that if you serve your pages as ISO-8859-1, the browser will treat them as cp1252 instead. However, they really are two distinct encodings:
>>> u'He said \u201CHello\u201D'.encode('iso-8859-1')
UnicodeEncodeError
>>> u'He said \u201CHello\u201D'.encode('cp1252')
'He said \x93Hello\x94'
If you are using your database only as a byte store, you can use cp1252 to encode “ and other characters present in the Windows Western code page. But still other Unicode characters which are not present in cp1252 will cause errors.
You can use encode(..., 'ignore') to suppress the errors by getting rid of the characters, but really in this century you should be using UTF-8 in both your database and your pages. This encoding allows any character to be used. You should also ideally tell MySQL you are using UTF-8 strings (by setting the database connection and the collation on string columns), so it can get case-insensitive comparison and sorting right.
The best solution is
set mysql's charset to 'utf-8'
do like this comment(add use_unicode=True and charset="utf8")
db = MySQLdb.connect(host="localhost", user = "root", passwd = "", db = "testdb", use_unicode=True, charset="utf8") – KyungHoon Kim Mar
13 '14 at 17:04
detail see :
class Connection(_mysql.connection):
"""MySQL Database Connection Object"""
default_cursor = cursors.Cursor
def __init__(self, *args, **kwargs):
"""
Create a connection to the database. It is strongly recommended
that you only use keyword parameters. Consult the MySQL C API
documentation for more information.
host
string, host to connect
user
string, user to connect as
passwd
string, password to use
db
string, database to use
port
integer, TCP/IP port to connect to
unix_socket
string, location of unix_socket to use
conv
conversion dictionary, see MySQLdb.converters
connect_timeout
number of seconds to wait before the connection attempt
fails.
compress
if set, compression is enabled
named_pipe
if set, a named pipe is used to connect (Windows only)
init_command
command which is run once the connection is created
read_default_file
file from which default client values are read
read_default_group
configuration group to use from the default file
cursorclass
class object, used to create cursors (keyword only)
use_unicode
If True, text-like columns are returned as unicode objects
using the connection's character set. Otherwise, text-like
columns are returned as strings. columns are returned as
normal strings. Unicode objects will always be encoded to
the connection's character set regardless of this setting.
charset
If supplied, the connection character set will be changed
to this character set (MySQL-4.1 and newer). This implies
use_unicode=True.
sql_mode
If supplied, the session SQL mode will be changed to this
setting (MySQL-4.1 and newer). For more details and legal
values, see the MySQL documentation.
client_flag
integer, flags to use or 0
(see MySQL docs or constants/CLIENTS.py)
ssl
dictionary or mapping, contains SSL connection parameters;
see the MySQL documentation for more details
(mysql_ssl_set()). If this is set, and the client does not
support SSL, NotSupportedError will be raised.
local_infile
integer, non-zero enables LOAD LOCAL INFILE; zero disables
autocommit
If False (default), autocommit is disabled.
If True, autocommit is enabled.
If None, autocommit isn't set and server default is used.
There are a number of undocumented, non-standard methods. See the
documentation for the MySQL C API for some hints on what they do.
"""
I hope your database is at least UTF-8. Then you will need to run yourstring.encode('utf-8') before you try putting it into the database.
Use the below snippet to convert the text from Latin to English
import unicodedata
def strip_accents(text):
return "".join(char for char in
unicodedata.normalize('NFKD', text)
if unicodedata.category(char) != 'Mn')
strip_accents('áéíñóúü')
output:
'aeinouu'
You are trying to store a Unicode codepoint \u201c using an encoding ISO-8859-1 / Latin-1 that can't describe that codepoint. Either you might need to alter the database to use utf-8, and store the string data using an appropriate encoding, or you might want to sanitise your inputs prior to storing the content; i.e. using something like Sam Ruby's excellent i18n guide. That talks about the issues that windows-1252 can cause, and suggests how to process it, plus links to sample code!
SQLAlchemy users can simply specify their field as convert_unicode=True.
Example:
sqlalchemy.String(1000, convert_unicode=True)
SQLAlchemy will simply accept unicode objects and return them back, handling the encoding itself.
Docs
Latin-1 (aka ISO 8859-1) is a single octet character encoding scheme, and you can't fit \u201c (“) into a byte.
Did you mean to use UTF-8 encoding?
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 106: ordinal not in range(256)
Solution 1:
\u2013 - google the character meaning to identify what character actually causing this error, Then you can replace that specific character, in the string with some other character, that's part of the encoding you are using.
Solution 2:
Change the string encoding to some encoding which includes all the character of your string. and then you can print that string, it will work just fine.
below code is used to change encoding of the string , borrowed from #bobince
u'He said \u201CHello\u201D'.encode('cp1252')
The latest version of mysql.connector has only
db.set_charset_collation('utf8', 'utf8_general_ci')
and NOT
db.set_character_set('utf8') //This feature is not available
I ran into the same problem when I was using PyMySQL. I checked this package version, it's 0.7.9.
Then I uninstall it and reinstall PyMySQL-1.0.2, the issue is solved.
pip uninstall PyMySQL
pip install PyMySQL
Python: You will need to add
# - * - coding: UTF-8 - * - (remove the spaces around * )
to the first line of the python file. and then add the following to the text to encode: .encode('ascii', 'xmlcharrefreplace'). This will replace all the unicode characters with it's ASCII equivalent.
I have a python sgi script that attempts to extract an rss items that is posted to it and store the rss in a sqlite3 db. I am using flup as the WSGIServer.
To obtain the posted content:
postData = environ["wsgi.input"].read(int(environ["CONTENT_LENGTH"]))
To attempt to store in the db:
from pysqlite2 import dbapi2 as sqlite
ldb = sqlite.connect("/var/vhost/mysite.com/db/rssharvested.db")
lcursor = ldb.cursor()
lcursor.execute("INSERT into rss(data) VALUES(?)", (postData,))
This results in only the first few characters of the rss being stored in the record:
ÿþ<
I believe the initial chars are the BOM of the rss.
I have tried every permutation I could think of including first encoding rss as utf-8 and then attempting to store but the results were the same. I could not decode because some characters could not be represented as unicode.
Running python 2.5.2
sqlite 3.5.7
Thanks in advance for any insight into this problem.
Here is a sample of the initial data contained in postData as modified by the repr function, written to a file and viewed with less:
'\xef\xbb\xbf
Thanks for the all the replies! Very helpful.
The sample I submitted didn't make it through the stackoverflow html filters will try again, converting less and greater than to entities (preview indicates this works).
\xef\xbb\xbf<?xml version="1.0" encoding="utf-16"?><rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><channel><item d3p1:size="0" xsi:type="tFileItem" xmlns:d3p1="http://htinc.com/opensearch-ex/1.0/">
Regarding the insertion encoding - in any decent database API, you should insert unicode strings and unicode strings only.
For the reading and parsing bit, I'd recommend Mark Pilgrim's Feed Parser. It properly handles BOM, and the license allows commercial use. This may be a bit too heavy handed if you are not doing any actual parsing on the RSS data.
Are you sure your incoming data are encoded as UTF-16 (otherwise known as UCS-2)?
UTF-16 encoded unicode strings typically include lots of NUL characters (surely for all characters existing in ASCII too), so UTF-16 data hardly can be stored in environment variables (env vars in POSIX are NUL terminated).
Please provide samples of the postData variable contents. Output them using repr().
Until then, the solid advice is: in all DB interactions, your strings on the Python side should be unicode strings; the DB interface should take care of all translations/encodings/decodings necessary.
Before the SQL insertion you should to convert the string to unicode compatible strings. If you raise an UnicodeError exception, then encode the string.encode("utf-8").
Or , you can autodetect encoding and encode it , on his encode schema. Auto detect encoding