UnicodeEncodeError: 'latin-1' codec can't encode character - python

What could be causing this error when I try to insert a foreign character into the database?
>>UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in position 0: ordinal not in range(256)
And how do I resolve it?
Thanks!

I ran into this same issue when using the Python MySQLdb module. Since MySQL will let you store just about any binary data you want in a text field regardless of character set, I found my solution here:
Using UTF8 with Python MySQLdb
Edit: Quote from the above URL to satisfy the request in the first comment...
"UnicodeEncodeError:'latin-1' codec can't encode character ..."
This is because MySQLdb normally tries to encode everythin to latin-1.
This can be fixed by executing the following commands right after
you've etablished the connection:
db.set_character_set('utf8')
dbc.execute('SET NAMES utf8;')
dbc.execute('SET CHARACTER SET utf8;')
dbc.execute('SET character_set_connection=utf8;')
"db" is the result of MySQLdb.connect(), and "dbc" is the result of
db.cursor().

Character U+201C Left Double Quotation Mark is not present in the Latin-1 (ISO-8859-1) encoding.
It is present in code page 1252 (Western European). This is a Windows-specific encoding that is based on ISO-8859-1 but which puts extra characters into the range 0x80-0x9F. Code page 1252 is often confused with ISO-8859-1, and it's an annoying but now-standard web browser behaviour that if you serve your pages as ISO-8859-1, the browser will treat them as cp1252 instead. However, they really are two distinct encodings:
>>> u'He said \u201CHello\u201D'.encode('iso-8859-1')
UnicodeEncodeError
>>> u'He said \u201CHello\u201D'.encode('cp1252')
'He said \x93Hello\x94'
If you are using your database only as a byte store, you can use cp1252 to encode “ and other characters present in the Windows Western code page. But still other Unicode characters which are not present in cp1252 will cause errors.
You can use encode(..., 'ignore') to suppress the errors by getting rid of the characters, but really in this century you should be using UTF-8 in both your database and your pages. This encoding allows any character to be used. You should also ideally tell MySQL you are using UTF-8 strings (by setting the database connection and the collation on string columns), so it can get case-insensitive comparison and sorting right.

The best solution is
set mysql's charset to 'utf-8'
do like this comment(add use_unicode=True and charset="utf8")
db = MySQLdb.connect(host="localhost", user = "root", passwd = "", db = "testdb", use_unicode=True, charset="utf8") – KyungHoon Kim Mar
13 '14 at 17:04
detail see :
class Connection(_mysql.connection):
"""MySQL Database Connection Object"""
default_cursor = cursors.Cursor
def __init__(self, *args, **kwargs):
"""
Create a connection to the database. It is strongly recommended
that you only use keyword parameters. Consult the MySQL C API
documentation for more information.
host
string, host to connect
user
string, user to connect as
passwd
string, password to use
db
string, database to use
port
integer, TCP/IP port to connect to
unix_socket
string, location of unix_socket to use
conv
conversion dictionary, see MySQLdb.converters
connect_timeout
number of seconds to wait before the connection attempt
fails.
compress
if set, compression is enabled
named_pipe
if set, a named pipe is used to connect (Windows only)
init_command
command which is run once the connection is created
read_default_file
file from which default client values are read
read_default_group
configuration group to use from the default file
cursorclass
class object, used to create cursors (keyword only)
use_unicode
If True, text-like columns are returned as unicode objects
using the connection's character set. Otherwise, text-like
columns are returned as strings. columns are returned as
normal strings. Unicode objects will always be encoded to
the connection's character set regardless of this setting.
charset
If supplied, the connection character set will be changed
to this character set (MySQL-4.1 and newer). This implies
use_unicode=True.
sql_mode
If supplied, the session SQL mode will be changed to this
setting (MySQL-4.1 and newer). For more details and legal
values, see the MySQL documentation.
client_flag
integer, flags to use or 0
(see MySQL docs or constants/CLIENTS.py)
ssl
dictionary or mapping, contains SSL connection parameters;
see the MySQL documentation for more details
(mysql_ssl_set()). If this is set, and the client does not
support SSL, NotSupportedError will be raised.
local_infile
integer, non-zero enables LOAD LOCAL INFILE; zero disables
autocommit
If False (default), autocommit is disabled.
If True, autocommit is enabled.
If None, autocommit isn't set and server default is used.
There are a number of undocumented, non-standard methods. See the
documentation for the MySQL C API for some hints on what they do.
"""

I hope your database is at least UTF-8. Then you will need to run yourstring.encode('utf-8') before you try putting it into the database.

Use the below snippet to convert the text from Latin to English
import unicodedata
def strip_accents(text):
return "".join(char for char in
unicodedata.normalize('NFKD', text)
if unicodedata.category(char) != 'Mn')
strip_accents('áéíñóúü')
output:
'aeinouu'

You are trying to store a Unicode codepoint \u201c using an encoding ISO-8859-1 / Latin-1 that can't describe that codepoint. Either you might need to alter the database to use utf-8, and store the string data using an appropriate encoding, or you might want to sanitise your inputs prior to storing the content; i.e. using something like Sam Ruby's excellent i18n guide. That talks about the issues that windows-1252 can cause, and suggests how to process it, plus links to sample code!

SQLAlchemy users can simply specify their field as convert_unicode=True.
Example:
sqlalchemy.String(1000, convert_unicode=True)
SQLAlchemy will simply accept unicode objects and return them back, handling the encoding itself.
Docs

Latin-1 (aka ISO 8859-1) is a single octet character encoding scheme, and you can't fit \u201c (“) into a byte.
Did you mean to use UTF-8 encoding?

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 106: ordinal not in range(256)
Solution 1:
\u2013 - google the character meaning to identify what character actually causing this error, Then you can replace that specific character, in the string with some other character, that's part of the encoding you are using.
Solution 2:
Change the string encoding to some encoding which includes all the character of your string. and then you can print that string, it will work just fine.
below code is used to change encoding of the string , borrowed from #bobince
u'He said \u201CHello\u201D'.encode('cp1252')

The latest version of mysql.connector has only
db.set_charset_collation('utf8', 'utf8_general_ci')
and NOT
db.set_character_set('utf8') //This feature is not available

I ran into the same problem when I was using PyMySQL. I checked this package version, it's 0.7.9.
Then I uninstall it and reinstall PyMySQL-1.0.2, the issue is solved.
pip uninstall PyMySQL
pip install PyMySQL

Python: You will need to add
# - * - coding: UTF-8 - * - (remove the spaces around * )
to the first line of the python file. and then add the following to the text to encode: .encode('ascii', 'xmlcharrefreplace'). This will replace all the unicode characters with it's ASCII equivalent.

Related

How to encode international strings with emoticons and special characters for storing in database

I want to use a API from a game and store the player and clan names in a local database. The names can contain all sorts of characters and emoticons. Here are just a few examples I found:
⭐💎
яαℓαηι
نکل
窝猫
鐵擊道遊隊
❤✖❤♠️♦️♣️✖
I use python for reading the api and write it into a mysql database. After that, I want to use the names on a Node.js web application.
What is the best way to encode those characters and how can I savely store them in the database, so that I can display them correcly afterwards?
I tried to encode the strings in python with utf-8:
>>> sample = '蛙喜鄉民CLUB'
>>> sample
'蛙喜鄉民CLUB'
>>> sample = sample.encode('UTF-8')
>>> sample
b'\xe8\x9b\x99\xe5\x96\x9c\xe9\x84\x89\xe6\xb0\x91CLUB'
and storing the encoded string in a mysql database with utf8mb4_unicode_ci character set.
When I store the string from above and select it inside mysql workbench it is displayed like this:
蛙喜鄉民CLUB
When I read this string from the database again in python (and store it in db_str) I get:
>>> db_str
èåéæ°CLUB
>>> db_str.encode('UTF-8')
b'\xc3\xa8\xc2\x9b\xc2\x99\xc3\xa5\xc2\x96\xc2\x9c\xc3\xa9\xc2\x84\xc2\x89\xc3\xa6\xc2\xb0\xc2\x91CLUB'
The first output is total gibberish, the second one with utf-8 looks mostly like the encoded string from above, but with added \xc2 or \xc3 between each byte.
How can I save such strings into mysql, so that I can read them again and display them correctly inside a python script?
Is my database collation utf8mb4_unicode_ci not suitable for such content? Or do I have to use another encoding?
As described by #abarnert in a comment to the question, the problem was that the library used for written the unicode strings didn't know that utf-8 should be used and therefor encoded the strings wrong.
After adding charset='utf8mb4' as parameter to the mysql connection the string get written correctly in the intended encoding.
All I had to change was
conn = MySQLdb.connect(host, user, pass, db, port)
to
conn = MySQLdb.connect(host, user, pass, db, port, charset='utf8mb4')
and after that my approach described in the question worked flawlessly.
edit: after declaring the charset='utf8mb4' parameter on the connection object it is no longer necessary to encode the strings, as that gets now already successfully done by the mysqlclient library.

Python with MySql unicode problems

I need to call MySQL stored procedure from my python script. As one of parameters I'm passing a unicode string (Russian language), but I get an error;
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)
My script:
self.db=MySQLdb.connect("localhost", "usr", "pass", "dbName")
self.cursor=self.db.cursor()
args=("какой-то текст") #this is string in russian
self.cursor.callproc('pr_MyProc', args)
self.cursor.execute('SELECT #_pr_MyProc_2') #getting result from sp
result=self.cursor.fetchone()
self.db.commit()
I've read that setting charset='utf8' shuld resolve this problem, but when I use string:
self.db=MySQLdb.connect("localhost", "usr", "pass", "dbName", charset='utf8')
This gives me another error;
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd1' in position 20: surrogates not allowed
Also I've trying to set parametr use_unicode=True, that's not working.
More things to check on:
http://mysql.rjweb.org/doc.php/charcoll#python
Likely items:
Start code file with # -*- coding: utf-8 -*- -- (for literals in code)
Literals should be u'...'
Can you extract the HEX? какой-то текст should be this in utf8: D0BA D0B0 D0BA D0BE D0B9 2D D182 D0BE D182 20 D0B5 D0BA D181 D182
Here are some thoughts. Maybe not a response. I've been playing with python/mysql/utf-8/unicode in the past and this is the things i remember:
Looking at Saltstack mysql module's comment :
https://github.com/saltstack/salt/blob/develop/salt/modules/mysql.py#L314-L322
# MySQLdb states that this is required for charset usage
# but in fact it's more than it's internally activated
# when charset is used, activating use_unicode here would
# retrieve utf8 strings as unicode() objects in salt
# and we do not want that.
#_connarg('connection_use_unicode', 'use_unicode')
connargs['use_unicode'] = False
_connarg('connection_charset', 'charset')
We see that to avoid altering the result string the use_unicode is set to False, while the charset (which could be utf-8) is set as a parameter. use_unicode is more a 'request' to get responses as unicode strings.
You can check real usage in the tests, here:
https://github.com/saltstack/salt/blob/develop/tests/integration/modules/test_mysql.py#L311-L361 with a database named '標準語'.
Now about the message UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd1' **. You are using **unicode but you tell the module it is utf-8. It is not utf-8 until you encode your unicode string in utf-8.
Maybe you should try with:
args=(u"какой-то текст".encode('utf-8'))
At least in python3 this is required, because your "какой-то текст" is not in utf-8 by default.
The MySQLdb module is not compatible with python 3. That might be why you are getting problems. I would advise to use a different connector, like PyMySQL or mysqlclient.
Related: 23376103.
Maybe you can reload your sys in utf-8 and try to decode the string into utf-8 as following :
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
...
stringUtf8 = u''.join(string_original).decode('utf-8')
I had a similar problem very recently but with PostgreSQL. After trying tons of suggestions from SO/ internet, I realized the issue was with my database. I had to drop my database and reinstall Postgres, because for some reason it was not allowing me to change the database's default collation. I was in a hurry so couldn't find a better solution, but would recommend the same, since I was only starting my application in the deployment environment.
All the best.
What's your database's charset?
use :
show variables like "characetr%";
or see your database's charset
I see here two problems.
You have unicode but you try to define it as utf-8 by setting parameter "charset". You should first encode your unicode to utf-8 or another encoding system.
If it however doesn't work, try to do so with init_command='SET NAMES UTF8' parameter.
So it will look like:
conn = MySQLdb.connect(charset='utf8', init_command='SET NAMES UTF8')
You can try also this:
cursor = db.cursor()
cursor.execute("SET NAMES UTF8;")
I encountered a similar issue, which was caused by invalid utf-8 data in the database; it seems that MySQL doesn't care about that, but Python does, because it's following the UTF-8 spec, which says that:
surrogate pairs are not allowed in utf-8
unpaired surrogates are not allowed in utf-8
If you want to "make it work", you'll have to intercept the MySQL packet and use your own converter which will perform ad-hoc replacements.
Here's one way to "handle" invalid data containing surrogates:
def borked_utf8_decode(data):
"""
Work around input with unpaired surrogates or surrogate pairs,
replacing by XML char refs: look for "&#\d+;" after.
"""
return data.decode("utf-8", "surrogatepass") \
.encode("utf-8", "xmlcharrefreplace") \
.decode("utf-8")
Note that the proper way to handle that is context-dependent, but there are some common replacement scenarios, like this one.
And here's one way of plugging this into pymysql (another way is to monkey-patch field processing, see eg. https://github.com/PyMySQL/PyMySQL/issues/631):
import pymysql.converters
# use this in your connection
pymysql_use_unicode = False
conversions = pymysql.converters.conversions
conversions[pymysql.converters.FIELD_TYPE.STRING] = borked_utf8_decode
conversions[pymysql.converters.FIELD_TYPE.VAR_STRING] = borked_utf8_decode
conversions[pymysql.converters.FIELD_TYPE.VARCHAR] = borked_utf8_decode

How to encode (utf8mb4) in Python

How do I encode something in ut8mb4 in Python?
I have two sets of data: data I am migrating to my new MySQL database over from Parse, and data going forward (that talks only to my new database). My database is utf8mb4 in order to store emoji and accented letters.
The first set of data only shows up correctly (when emoji and accents are involved) when I have in my python script:
MySQLdb.escape_string(unicode(xstr(data.get('message'))).encode('utf-8'))
and when reading from the MySQL database in PHP:
$row["message"] = utf8_encode($row["message"]);
The second set of data only shows up correctly (when emoji and accents are involved) when I DON'T include the utf8_encode($row["message"]) portion. I am trying to reconcile these so that both sets of data are returned correctly to my iOS app. Please help!
I have struggled myself with the correct exchange of the full range of UTF-8 characters between Python and MySQL for the sake of Emoji and other characters beyond the U+FFFF codepoint.
To be sure that everything worked fine, I had to do the following:
make sure utf8mb4 was used for CHAR, VARCHAR, and TEXT columns in MySQL
enforce UTF-8 in Python
enforce UTF-8 to be used between Python and MySQL
To enforce UTF-8 in Python, add the following line as first or second line of your Python script:
# -*- coding: utf-8 -*-
To enforce UTF-8 between Python and MySQL, setup the MySQL connection as follows:
# Connect to mysql.
dbc = MySQLdb.connect(host='###', user='###', passwd='###', db='###', use_unicode=True)
# Create a cursor.
cursor = dbc.cursor()
# Enforce UTF-8 for the connection.
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
# Do database stuff.
# Commit data.
dbc.commit()
# Close cursor and connection.
cursor.close()
dbc.close()
This way, you don't need to use functions such as encode and utf8_encode.
MySQL's utf8mb4 encoding is just standard UTF-8.
They had to add that name however to distinguish it from the broken UTF-8 character set which only supported BMP characters.
In other words, from the Python side you should always encode to UTF-8 when talking to MySQL, but take into account that the database may not be able to handle Unicode codepoints beyond U+FFFF, unless you use utf8mb4 on the MySQL side.
However, generally speaking, you want to avoid manually encoding and decoding, and instead leave it to MySQLdb worry about this. You do this by configuring your connection and your collations to handle Unicode text transparently. For MySQLdb, that means setting charset='utf8mb4':
database = MySQLdb.connect(
host=hostname,
user=username,
passwd=password,
db=databasename,
charset="utf8mb4"
)
Then use normal Python 3 str strings; leave the use_unicode option set to it's default True*.
Note: this handles SET NAMES and SET character_set_connection) for you, there is no need to issue those manually.
* Unless you still use Python 2, then the default is False. Set it to True and use u'...' unicode strings.
use_unicode=True didn't work for me.
My solution
in mysql, change entire database, table and field encoding to utf8mb4
MySQLdb.connect(host='###' [...], charset='utf8'
dbCursor.execute('SET NAMES utf8mb4')
dbCursor.execute("SET CHARACTER SET utf8mb4")
You can also enter the type of code that you want in the following way
mysql.connector.connect(host = '<host>', database = '<db>', user = '<user>', password = '<password>', charset = 'utf8')
The fields inside '<>' are your own details. Instead of 'utf8' you can also write 'utf8mb4' depending on the type of coding your mysqldb wants.

Using Oracle database with the encoding UTF-8

I have an oracle database which is encoded to US-ASCII because when I run the command print db.encoding, US-ASCII is the response.
My problem is that I have some special characters in the database like "ç" and "ã" and when I make some queries, these special characters are returned as question marks (?).
I need to encode the data as UTF-8 but I don't know how. I've already used methods like encode() and unicode() but nothing works.
Can anyone help me please?
Thanks.
I've resolved my problem with the command below before the database connection:
import os
os.environ["NLS_LANG"] = ".UTF8"
You must create the database in Oracle with NATIONAL CHARACTER SET clause. Check this post on how to achieve this.

Python encoding issue

i m working with some python script, got a raw string with UTF8 encoding. first of all i decoded it to utf8 then some processing is done and at the end i encode it back to utf8 and inserted to DB(mysql) but chars in DB are not presented in real format.
str = '<term>Beiträge</term>'
str = str.decode('utf8')
...
...
...
str = str.encode('utf8')
after that string is found in txt file in its real form but in MYSQL_DB, i found it like this
<term>"Beiträge</term>
any idea why this happened? :-(
Assuming you are using the MySQLdb library, you need to create connections using the keyword arguments:
use_unicode
If True, text-like columns are returned as unicode objects using the
connection's character set. Otherwise,
text-like columns are returned as
strings. columns are returned as
normal strings. Unicode objects will
always be encoded to the connection's
character set regardless of this
setting.
&
charset
If supplied, the connection character set will be changed to this
character set (MySQL-4.1 and newer).
This implies use_unicode=True.
You should also check the encoding of your db tables.
To make a string a Unicode string you should use the stringprefix 'u'. See also here http://docs.python.org/reference/lexical_analysis.html#literals
Maybe your example works by just adding the prefix in the initial assignment.

Categories

Resources