Python encoding issue

Python encoding issue - python

i m working with some python script, got a raw string with UTF8 encoding. first of all i decoded it to utf8 then some processing is done and at the end i encode it back to utf8 and inserted to DB(mysql) but chars in DB are not presented in real format.
str = '<term>Beiträge</term>'
str = str.decode('utf8')
...
...
...
str = str.encode('utf8')
after that string is found in txt file in its real form but in MYSQL_DB, i found it like this
<term>"BeitrÃ¤ge</term>
any idea why this happened? :-(

Assuming you are using the MySQLdb library, you need to create connections using the keyword arguments:
use_unicode
If True, text-like columns are returned as unicode objects using the
connection's character set. Otherwise,
text-like columns are returned as
strings. columns are returned as
normal strings. Unicode objects will
always be encoded to the connection's
character set regardless of this
setting.
&
charset
If supplied, the connection character set will be changed to this
character set (MySQL-4.1 and newer).
This implies use_unicode=True.
You should also check the encoding of your db tables.

To make a string a Unicode string you should use the stringprefix 'u'. See also here http://docs.python.org/reference/lexical_analysis.html#literals
Maybe your example works by just adding the prefix in the initial assignment.

Related

String encode/decode issue - missing character from end

I am having NVARCHAR type column in my database. I am unable to convert the content of this column to plain string in my code. (I am using pyodbc for the database connection).
# This unicode string is returned by the database
>>> my_string = u'\u4157\u4347\u6e65\u6574\u2d72\u3430\u3931\u3530\u3731\u3539\u3533\u3631\u3630\u3530\u3330\u322d\u3130\u3036\u3036\u3135\u3432\u3538\u2d37\u3134\u3039\u352d'
# prints something in chineese
>>> print my_string
䅗䍇湥整⵲㐰㤱㔰㜱㔹㔳㘱㘰㔰㌰㈭㄰〶〶ㄵ㐲㔸ⴷㄴ〹㔭
The closest I have gone is via encoding it to utf-16 as:
>>> my_string.encode('utf-16')
'\xff\xfeWAGCenter-04190517953516060503-20160605124857-4190-5'
>>> print my_string.encode('utf-16')
��WAGCenter-04190517953516060503-20160605124857-4190-5
But the actual value that I need as per the value store in database is:
WAGCenter-04190517953516060503-20160605124857-4190-51
I tried with encoding it to utf-8, utf-16, ascii, utf-32 but nothing seemed to work.
Does anyone have the idea regarding what I am missing? And how to get the desired result from the my_string.
Edit: On converting it to utf-16-le, I am able to remove unwanted characters from start, but still one character is missing from end
>>> print t.encode('utf-16-le')
WAGCenter-04190517953516060503-20160605124857-4190-5
On trying for some other columns, it is working. What might be the cause of this intermittent issue?

You have a major problem in your database definition, in the way you store values in it, or in the way you read values from it. I can only explain what you are seeing, but neither why nor how to fix it without:
the type of the database
the way you input values in it
the way you extract values to obtain your pseudo unicode string
the actual content if you use direct (native) database access
What you get is an ASCII string, where the 8 bits characters are grouped by pair to build 16 bit unicode characters in little endian order. As the expected string has an odd numbers of characters, the last character was (irremediably) lost in translation, because the original string ends with u'\352d' where 0x2d is ASCII code for '-' and 0x35 for '5'. Demo:
def cvt(ustring):
l = []
for uc in ustring:
l.append(chr(ord(uc) & 0xFF)) # low order byte
l.append(chr((ord(uc) >> 8) & 0xFF)) # high order byte
return ''.join(l)
cvt(my_string)
'WAGCenter-04190517953516060503-20160605124857-4190-5'

The issue was, I was using UTF-16 in my odbcinst.ini file where as I had to use UTF-8 format of character encoding.
Earlier I was changing it as an OPTION parameter while making connection to PyODBC. But later changing it in odbcinst.ini file fixed the issue.

Are null bytes allowed in unicode strings in PostgreSQL via Python?

Are null bytes allowed in unicode strings?
I don't ask about utf8, I mean the high level object representation of a unicode string.
Background
We store unicode strings containing null bytes via Python in PostgreSQL.
The strings cut at the null byte if we read it again.

About the database side, PostgreSQL itself does not allow null byte ('\0') in a string on char/text/varchar fields, so if you try to store a string containing it you receive an error. Example:
postgres=# SELECT convert_from('foo\000bar'::bytea, 'unicode');
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0x00
If you really need to store such information, then you can use bytea data type on PostgreSQL side. Make to sure to encode it correctly.

Python itself is perfectly capable of having both byte strings and Unicode strings with null characters having a value of zero. However if you call out to a library implemented in C, that library may use the C convention of stopping at the first null character.

Since a string is basically just data and a pointer, you can save null in it. However, since null represents the end of the string ("null terminator "), there is no way to read beyond the null without knowing the size ahead of reading.
Therefore, seems that you ought to store your data in binary and read it as a buffer.
Good luck!

support of utf-8 when dealing with mysqldb

when I insert a query in mysql,
I do the following:
execute(query.encode('utf8'))
I first make the string compatible to unicode and then execute.
Now when I am retrieving the data again from sql to my python code, do I need to make it unicode compatible again by applying encode function?
If yes, then sql returns block of rows, How do I encode all?
I get rows using cursor.fetchall(). how do i make each element of row as unicode compatible?

step one: read and understand this: http://www.joelonsoftware.com/articles/Unicode.html #IMPORTANT
Now you know that all text data in your program should be unicode (and you are not
mistaking unicode data for an utf-8 encoded string, or "encode" with "decode" anymore)
You can specify the encoding to be used when creating your mysql connection - specify it there: http://mysql-python.sourceforge.net/MySQLdb-1.2.2/ <-check for the "charset" parameter in the "init" method of the "connections" module - it can be passed as a parameter to the MySQLdb.connect call
From there on, all text data retrieved from this connection will be presented to Python as unicode, encoding the data stored on the database as unicode strings.

strange python regex behavior - maybe connected to unicode or sqlalchemy

I'm trying to search for a pattern in sqlalchemy results (actually filter by a 'like' or 'op'('regexp')(pattern) which I believe is implanted with regex somewhere) - the string and the search string are both in hebrew, and presumably (maybe I'm wrong-)-unicode
where r = u'לבן' and c = u'לבן, ורוד, '
when I do re.search(r,c) I get the SRE.match object
but when I query the db like:
f = session.query(classname)
c = f[0].color
and c gives me:
'\xd7\x9c\xd7\x91\xd7\x9f,\xd7\x95\xd7\xa8\xd7\x95\xd7\x93,'
or print (c):
לבן,ורוד,
practicaly the same but running re.search(r,c) gives me no match object.
Since I suspected a unicode issue I tried to transform to unicode with unicode(c)
and I get an 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 0: ordinal' which I guess means this is already unicode string - so where's the catch here?
I would prefer using the sqlalchemy 'like' but I get the same problem there = where I know for sure (as I showed in my example that the data contains the string)
Should I transform the search string,pattern somehow? is this related to unicode? something else?
The db table (which I'm quering) collation is utf8_unicode_ci

c = f[0].color
is not returning a Unicode string (or its repr() would show a u'...' kind of string), but a UTF-8 encoded string.
Try
c = f[0].color.decode("utf-8")
which results in
u'\u05dc\u05d1\u05df,\u05d5\u05e8\u05d5\u05d3,'
or
u'לבן,ורוד,'
if your console can display Hebrew characters.

'\xd7\x9c\xd7\x91\xd7\x9f,\xd7\x95\xd7\xa8\xd7\x95\xd7\x93, is encoded representation of string u'לבן, ורוד, '. So in the second example you should write re.search(r,c.decode('utf-8'))
You're trying to do almost the same except setting encoding parameter. It makes python try ascii encoding

UnicodeEncodeError: 'latin-1' codec can't encode character

What could be causing this error when I try to insert a foreign character into the database?
>>UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in position 0: ordinal not in range(256)
And how do I resolve it?
Thanks!

I ran into this same issue when using the Python MySQLdb module. Since MySQL will let you store just about any binary data you want in a text field regardless of character set, I found my solution here:
Using UTF8 with Python MySQLdb
Edit: Quote from the above URL to satisfy the request in the first comment...
"UnicodeEncodeError:'latin-1' codec can't encode character ..."
This is because MySQLdb normally tries to encode everythin to latin-1.
This can be fixed by executing the following commands right after
you've etablished the connection:
db.set_character_set('utf8')
dbc.execute('SET NAMES utf8;')
dbc.execute('SET CHARACTER SET utf8;')
dbc.execute('SET character_set_connection=utf8;')
"db" is the result of MySQLdb.connect(), and "dbc" is the result of
db.cursor().

Character U+201C Left Double Quotation Mark is not present in the Latin-1 (ISO-8859-1) encoding.
It is present in code page 1252 (Western European). This is a Windows-specific encoding that is based on ISO-8859-1 but which puts extra characters into the range 0x80-0x9F. Code page 1252 is often confused with ISO-8859-1, and it's an annoying but now-standard web browser behaviour that if you serve your pages as ISO-8859-1, the browser will treat them as cp1252 instead. However, they really are two distinct encodings:
>>> u'He said \u201CHello\u201D'.encode('iso-8859-1')
UnicodeEncodeError
>>> u'He said \u201CHello\u201D'.encode('cp1252')
'He said \x93Hello\x94'
If you are using your database only as a byte store, you can use cp1252 to encode “ and other characters present in the Windows Western code page. But still other Unicode characters which are not present in cp1252 will cause errors.
You can use encode(..., 'ignore') to suppress the errors by getting rid of the characters, but really in this century you should be using UTF-8 in both your database and your pages. This encoding allows any character to be used. You should also ideally tell MySQL you are using UTF-8 strings (by setting the database connection and the collation on string columns), so it can get case-insensitive comparison and sorting right.

The best solution is
set mysql's charset to 'utf-8'
do like this comment(add use_unicode=True and charset="utf8")
db = MySQLdb.connect(host="localhost", user = "root", passwd = "", db = "testdb", use_unicode=True, charset="utf8") – KyungHoon Kim Mar
13 '14 at 17:04
detail see :
class Connection(_mysql.connection):
"""MySQL Database Connection Object"""
default_cursor = cursors.Cursor
def __init__(self, *args, **kwargs):
"""
Create a connection to the database. It is strongly recommended
that you only use keyword parameters. Consult the MySQL C API
documentation for more information.
host
string, host to connect
user
string, user to connect as
passwd
string, password to use
db
string, database to use
port
integer, TCP/IP port to connect to
unix_socket
string, location of unix_socket to use
conv
conversion dictionary, see MySQLdb.converters
connect_timeout
number of seconds to wait before the connection attempt
fails.
compress
if set, compression is enabled
named_pipe
if set, a named pipe is used to connect (Windows only)
init_command
command which is run once the connection is created
read_default_file
file from which default client values are read
read_default_group
configuration group to use from the default file
cursorclass
class object, used to create cursors (keyword only)
use_unicode
If True, text-like columns are returned as unicode objects
using the connection's character set. Otherwise, text-like
columns are returned as strings. columns are returned as
normal strings. Unicode objects will always be encoded to
the connection's character set regardless of this setting.
charset
If supplied, the connection character set will be changed
to this character set (MySQL-4.1 and newer). This implies
use_unicode=True.
sql_mode
If supplied, the session SQL mode will be changed to this
setting (MySQL-4.1 and newer). For more details and legal
values, see the MySQL documentation.
client_flag
integer, flags to use or 0
(see MySQL docs or constants/CLIENTS.py)
ssl
dictionary or mapping, contains SSL connection parameters;
see the MySQL documentation for more details
(mysql_ssl_set()). If this is set, and the client does not
support SSL, NotSupportedError will be raised.
local_infile
integer, non-zero enables LOAD LOCAL INFILE; zero disables
autocommit
If False (default), autocommit is disabled.
If True, autocommit is enabled.
If None, autocommit isn't set and server default is used.
There are a number of undocumented, non-standard methods. See the
documentation for the MySQL C API for some hints on what they do.
"""

I hope your database is at least UTF-8. Then you will need to run yourstring.encode('utf-8') before you try putting it into the database.

Use the below snippet to convert the text from Latin to English
import unicodedata
def strip_accents(text):
return "".join(char for char in
unicodedata.normalize('NFKD', text)
if unicodedata.category(char) != 'Mn')
strip_accents('áéíñóúü')
output:
'aeinouu'

You are trying to store a Unicode codepoint \u201c using an encoding ISO-8859-1 / Latin-1 that can't describe that codepoint. Either you might need to alter the database to use utf-8, and store the string data using an appropriate encoding, or you might want to sanitise your inputs prior to storing the content; i.e. using something like Sam Ruby's excellent i18n guide. That talks about the issues that windows-1252 can cause, and suggests how to process it, plus links to sample code!

SQLAlchemy users can simply specify their field as convert_unicode=True.
Example:
sqlalchemy.String(1000, convert_unicode=True)
SQLAlchemy will simply accept unicode objects and return them back, handling the encoding itself.
Docs

Latin-1 (aka ISO 8859-1) is a single octet character encoding scheme, and you can't fit \u201c (“) into a byte.
Did you mean to use UTF-8 encoding?

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 106: ordinal not in range(256)
Solution 1:
\u2013 - google the character meaning to identify what character actually causing this error, Then you can replace that specific character, in the string with some other character, that's part of the encoding you are using.
Solution 2:
Change the string encoding to some encoding which includes all the character of your string. and then you can print that string, it will work just fine.
below code is used to change encoding of the string , borrowed from #bobince
u'He said \u201CHello\u201D'.encode('cp1252')

The latest version of mysql.connector has only
db.set_charset_collation('utf8', 'utf8_general_ci')
and NOT
db.set_character_set('utf8') //This feature is not available

I ran into the same problem when I was using PyMySQL. I checked this package version, it's 0.7.9.
Then I uninstall it and reinstall PyMySQL-1.0.2, the issue is solved.
pip uninstall PyMySQL
pip install PyMySQL

Python: You will need to add
# - * - coding: UTF-8 - * - (remove the spaces around * )
to the first line of the python file. and then add the following to the text to encode: .encode('ascii', 'xmlcharrefreplace'). This will replace all the unicode characters with it's ASCII equivalent.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python encoding issue - python

To make a string a Unicode string you should use the stringprefix 'u'. See also here http://docs.python.org/reference/lexical_analysis.html#literals Maybe your example works by just adding the prefix in the initial assignment.

Related

String encode/decode issue - missing character from end

Are null bytes allowed in unicode strings in PostgreSQL via Python?

support of utf-8 when dealing with mysqldb

strange python regex behavior - maybe connected to unicode or sqlalchemy

UnicodeEncodeError: 'latin-1' codec can't encode character

Categories

Resources