Python.27 - MySQL utf8 encoding - python

In my calling MySQL from Python I prepare it with "SET NAMES 'utf8'", but still something is not right. I get a sequence like this:
å½å®¶1级è¯ä¹¦
When I am supposed to get chinese characters, elsewhere always covered by utf8.
When I look at the utf8 code/sequence it clearly doesn't match the real one. Same sort of format, but different numbers.
Is this erroneous encoding on Python 2.7's end or bad programming on my end? I know Python 3.x has solved these issues but I cannot use the modules I want in later versions.
I know Python 2.7 can actually display chinese, by using the print operator, but it is otherwise stored and viewed as utf8-code. Look:
>>> '你好'
'\xc4\xe3\xba\xc3'
>>> print '\xc4\xe3\xba\xc3'
你好

Ok.. It seems adding
"SET NAMES 'gbk'"
before the MySQL SELECT query did the trick. Now at least the strings from my dictionary and from the sql database can be compared. It also seems that gbk is often the prefered char format in China.

Related

Python MySQL CSV export to json strange encoding

I received a csv file exported from a MySQL database (I think the encoding is latin1 since the language is spanish). Unfortunately the encoding is wrong and I cannot process it at all. If I use file:
$ file -I file.csv
file.csv: text/plain; charset=unknown-8bit
I have tried to read the file in python and convert it to utf-8 like:
r.decode('latin-1').encode("utf-8")
or using mysql_latin1_codec:
r.decode('mysql_latin1').encode('UTF-8')
I am trying to transform the data into json objects. The error comes when I save the file:
'UnicodeEncodeError: 'ascii' codec can't encode characters in position'
Do you know how can I convert it to normal utf-8 chars? Or how can I convert data to a valid json? Thanks!!
I got really good results by using pandas dataframe from Continuum Analytics.
You coud do something like:
import pandas as pd
from pandas import *
con='Your database connection credentials user, password, host, database to use'
data=pd.read_sql_query('SELECT * FROM YOUR TABLE',conn=con)
Then you could do:
data.to_csv('path_with_file_name')
or to convert to JSON:
data.to_json(orient='records')
or if you prefer to customize your json format see the documentation here:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html
Have you tried using the codecs module?:
import codecs
....
codecs.EncodedFile(r, 'latin1').reader.read()
I remember having a similar issue a while back and the answer was something to do with how encoding was done prior to Python 3. Codecs seems to handle this problem relatively elegantly.
As coder mentioned in the question comments, it's difficult to pinpoint the problem without being able to reproduce it so I may be barking up the wrong tree.
You probably have two problems. But let's back off... We can't tell whether the text was imported incorrectly, exported incorrectly, or merely displayed in a goofy way.
First, I am going to discuss "importing"...
Do not try to alter the encoding. Instead live with the encoding. But first, figure out what the encoding is. It could be latin1 or it could be utf8. (Or any of lots of less likely charsets.)
Find out the hex for the incoming file. In Python, the code is something like this for dumping hex (etc) for string u:
for i, c in enumerate(u):
print i, '%04x' % ord(c), unicodedata.category(c),
print unicodedata.name(c)
You can go here to see a list of hex values for all the latin1 characters, together with the utf8 hex. For example, ó is latin1 F3 or utf8 C2B3.
Now, armed with knowing the encoding, tell MySQL that.
LOAD DATA INFILE ...
...
CHARACTER SET utf8 -- or latin1
...;
Meanwhile, it does not matter what CHARACTER SET ... the table or column is defined to be; mysql will transcode if necessary. All Spanish characters are available in latin1 and utf8.
Go to this Q&A .
I suggested that you have two errors, one is the "black diamond" case mentioned there; there other is something else. But... Follow the "Best Practice" mentioned.
Back to you question of "exporting"...
Again, you need to check the hex of the output file. Again it does not matter whether it is latin1 or utf8. However... If the hex is C383C2B3 for simply ó, you have "double encoding". If you have that, check to see that you have removed any manual conversion function calls, and simply told MySQL what's what.
Here are some more utf8+Python tips you might need.
If you need more help, follow the text step-by-step. Show us the code used to move/convert it at each step, and show us the HEX at each step.

What is the function equivalent of prepending the 'b' character to a string literal in Python 2?

What function can I apply to a string variable that will cause the same result as prepending the b modifier to a string literal?
I've read in this question about the b modifier for string literals in Python 2 that prepending b to a string makes it a byte string (mainly for compatibility between Python 2 and Python 3 when using 2to3). The result I would like to obtain is the same, but applied to a variable, like so:
def is_binary_string_equal(string_variable):
binary_string = b'this is binary'
return convert_to_binary(string_variable) == binary_string
>>> convert_to_binary('this is binary')
[1] True
What is the correct definition of convert_to_binary?
First, note that in Python 2.x, the b prefix actually does nothing. b'foo' and 'foo' are both exactly the same string literal. The b only exists to allow you to write code that's compatible with both Python 2.x and Python 3.x: you can use b'foo' to mean "I want bytes in both versions", and u'foo' to mean "I want Unicode in both versions", and just plain 'foo' to mean "I want the default str type in both versions, even though that's Unicode in 3.x and bytes in 2.x".
So, "the functional equivalent of prepending the 'b' character to a string literal in Python 2" is literally doing nothing at all.
But let's assume that you actually have a Unicode string (like what you get out of a plain literal or a text file in Python 3, even though in Python 2 you can only get these by explicitly decoding, or using some function that does it for you, like opening a file with codecs.open). Because then it's an interesting question.
The short answer is: string_variable.encode(encoding).
But before you can do that, you need to know what encoding you want. You don't need that with a literal string, because when you use the b prefix in your source code, Python knows what encoding you want: the same encoding as your source code file.* But everything other than your source code—files you open and read, input the user types, messages coming in over a socket—could be anything, and Python has no idea; you have to tell it.**
In many cases (especially if you're on a reasonably recent non-Windows machine and dealing with local data), it's safe to assume that the answer is UTF-8, so you can spell convert_to_binary_string(string_variable) as string_variable.encode('utf8'). But "many" isn't "all".*** This is why text editors and web browsers let the user select an encoding—because sometimes only the user actually knows.
* See PEP 263 for how you can specify the encoding, and why you'd want to..
** You can also use bytes(s, encoding), which is a synonym for s.encode(encoding). And, in both cases, you can leave off the encoding argument—but then it defaults to something which is more likely to be ASCII than what you actually wanted, so don't do that.
*** For example, many older network protocols are defined as Latin-1. Many Windows text files are created in whatever the OEM charset is set to—usually cp1252 on American systems, but there are hundreds of other possibilities. Sometimes sys.getdefaultencoding() or locale.getpreferredencoding() gets what you want, but that obviously doesn't work when, say, you're processing a file that someone uploaded that's in his machine's preferred encoding, not yours.
In the special case where the relevant encoding is "whatever this particular source file is in", you pretty much have to know that somehow out-of-band.* Once a script or module has been compiled and loaded, it's no longer possible to tell what encoding it was originally in.**
But there shouldn't be much reason to want that. After all, if two binary strings are equal, and in the same encoding, the Unicode strings are also equal, and vice-versa, so you could just write your code as:
def is_binary_string_equal(string_variable):
binary_string = u'this is binary'
return string_variable == binary_string
* The default is, of course, documented—it's UTF-8 for 3.0, ASCII or Latin-1 for 2.x depending on your version. But you can override that, as PEP 263 explains.
** Well, you could use the inspect module to find the source, then the importlib module to start processing it, etc.—but that only works if the file is still there and hasn't been edited since you last compiled it.
Note that in python 3.7, executed on linux machine, it is not the same to use .encode('UTF-8') and b'string' .
It cause a lot of pain in a project of mine and to this day I have no clear understanding of why it happens but doing this in Python 3.7
print('\xAD\x43\x48\x49\x44\x44\x49\x4E\x47\x53\x54\x4F\x4E\x45'.encode('UTF-8'))
print(b'\xAD\x43\x48\x49\x44\x44\x49\x4E\x47\x53\x54\x4F\x4E\x45')
returns this on console
b'\xc2\xadCHIDDINGSTONE'
b'\xadCHIDDINGSTONE'

Curious about unicode / string encoding in Python 3

I'd like to ask why something works which I have found after painful hours of reading/trying to understand and in the end simply succesfull trial-and-error...
I'm on Linux (Ubuntu 13.04, German time formats etc., but english system language). My small python 3 script connects to a sqlite3 database of the reference manager Zotero. There I read a couple of keys with the goal of exporting files from the zotero storage directory (probably not important, and as said above, got it working).
All of this works fine with characters in the ascii set, but of course there are a lot of international authors in the database and my code used to fail on non-ascii authors/paper titles.
Perhaps first some info about the database on command line sqlite3:
sqlite3 zotero-test.sqlite
SQLite version 3.7.15.2 2013-01-09 11:53:05
sqlite> PRAGMA encoding;
UTF-8
Exemplary problematic entry:
sqlite> select * from itemattachments;
317|281|1|application/pdf|5|storage:Müller-Forell_2008_Orbitatumoren.pdf||2|1372357574000|2814ef3ea9c50cce2c32d6fb46b977bb
The correct name would be "storage:Müller-Forrell"; Zotero itself decodes this correctly, but SQLIte does not (at least dos not output it correctly in my terminal).
Google tells me that "ü" is a somehow incorrectly or not decoded latin-1/8859-1 "ü".
Reading this database entry in from python3 with
connection = sqlite3.connect("zotero-test.sqlite")`
cursor = connection.cursor()`
cursor.execute("SELECT itemattachments.itemID,itemattachments.sourceItemID,itemattachments.path,items.key FROM itemattachments,items WHERE mimetype=\"application/pdf\" AND items.itemID=itemattachments.itemID")
for pdf_result in cursor:
print(pdf_result[2])
print()
print(pdf_result[2].encode("latin-1").decode("utf-8"))
gives:
storage:Müller-Forell_2008_Orbitatumoren.pdf
storage:Müller-Forell_2008_Orbitatumoren.pdf
, the second being correct, so I got my script working (gosh how many hours this cost me...)
Can somebody explain to me what this construction of .encode and .decode does? Which one is even executed first?
Thanks for any clues,
Joost
The cursor yields strs. We run encode() on it to convert it to a bytes, and then decode it back into a str. It sounds like the data in the database is misencoded.
What you're seeing here is UTF8 data encoded in latin-1 stored in the SQLite database.
The sqlite module always returns unicode strings, so you first have to encode them into a unicode equivalent of latin-1 and then decode them as UTF8.
They shouldn't have been stored in the db as latin-1 to begin with.
You are executing encode before decode.

Django: unicode string gets written to database as non-unicode

I have written a basic script that imports several thousand values into a Django database. Here's how it looks like: link.
Those locations are in Cyrillic letters, and are represented as unicode literals. However, as soon as I save them to the database, they are converted to what seems to be encoded simple strings, in some sort of hex encoding:
>>> Region.objects.all()[0].parent
'\xd0\xbe\xd0\xb1\xd0\xbb\xd0\xb0\xd1\x81\xd1\x82 \xd0\xa1\xd0\xbb\xd0\xb8\xd0\xb2\xd0\xb5\xd0\xbd'
Surprisingly, they appear correctly in the admin panel, but I have trouble when trying to use them. How do I store and retrieve them as unicode?
I'm running Django 1.4.0 on top of MySQL, collation set to utf8_bin.
This is a Django/MySQL "bug". See issue #16052. It's actually documented here.
It looks like the data is being returned as a UTF-8 byte string rather than a Unicode string. Try decoding it:
>>> x='\xd0\xbe\xd0\xb1\xd0\xbb\xd0\xb0\xd1\x81\xd1\x82 \xd0\xa1\xd0\xbb\xd0\xb8\xd0\xb2\xd0\xb5\xd0\xbd'
>>> x.decode('utf-8')
u'\u043e\u0431\u043b\u0430\u0441\u0442 \u0421\u043b\u0438\u0432\u0435\u043d'
>>> print x.decode('utf-8')
област Сливен

Python and Unicode: How everything should be Unicode

Forgive if this a long a question:
I have been programming in Python for around six months. Self taught, starting with the Python tutorial and then SO and then just using Google for stuff.
Here is the sad part: No one told me all strings should be Unicode. No, I am not lying or making this up, but where does the tutorial mention it? And most examples also I see just make use of byte strings, instead of Unicode strings. I was just browsing and came across this question on SO, which says how every string in Python should be a Unicode string. This pretty much made me cry!
I read that every string in Python 3.0 is Unicode by default, so my questions are for 2.x:
Should I do a:
print u'Some text' or just print
'Text' ?
Everything should be Unicode, does this mean, like say I have a tuple:
t = ('First', 'Second'), it should be t = (u'First', u'Second')?
I read that I can do a from __future__ import unicode_literals and then every string will be a Unicode string, but should I do this inside a container also?
When reading/ writing to a file, I should use the codecs module. Right? Or should I just use the standard way or reading/ writing and encode or decode where required?
If I get the string from say raw_input(), should I convert that to Unicode also?
What is the common approach to handling all of the above issues in 2.x? The from __future__ import unicode_literals statement?
Sorry for being a such a noob, but this changes what I have been doing for a long time and so clearly I am confused.
The "always use Unicode" suggestion is primarily to make the transition to Python 3 easier. If you have a lot of non-Unicode string access in your code, it'll take more work to port it.
Also, you shouldn't have to decide on a case-by-case basis whether a string should be stored as Unicode or not. You shouldn't have to change the types of your strings and their very syntax just because you changed their contents, either.
It's also easy to use the wrong string type, leading to code that mostly works, or code which works in Linux but not in Windows, or in one locale but not another. For example, for c in "漢字" in a UTF-8 locale will iterate over each UTF-8 byte (all six of them), not over each character; whether that breaks things depends on what you do with them.
In principle, nothing should break if you use Unicode strings, but things may break if you use regular strings when you shouldn't.
In practice, however, it's a pain to use Unicode strings everywhere in Python 2. codecs.open doesn't pick the correct locale automatically; this fails:
codecs.open("blar.txt", "w").write(u"漢字")
The real answer is:
import locale, codecs
lang, encoding = locale.getdefaultlocale()
codecs.open("blar.txt", "w", encoding).write(u"漢字")
... which is cumbersome, forcing people to make helper functions just to open files. codecs.open should be using the encoding from locale automatically when one isn't specified; Python's failure to make such a simple operation convenient is one of the reasons people generally don't use Unicode everywhere.
Finally, note that Unicode strings are even more critical in Windows in some cases. For example, if you're in a Western locale and you have a file named "漢字", you must use a Unicode string to access it, eg. os.stat(u"漢字"). It's impossible to access it with a non-Unicode string; it just won't see the file.
So, in principle I'd say the Unicode string recommendation is reasonable, but with the caveat that I don't generally even follow it myself.
No, not every string "should be Unicode". Within your Python code, you know if the string literals needs to be Unicode or not, so it doesn't make any sense to make every string literal into a Unicode literal.
But there are cases where you should use Unicode. For example, if you have arbitrary input that is text, use Unicode for it. You will sooner or later find a non-american using it, and he want to wrîte têxt ås hé is üsed tö. And you'll get problems in that case unless your input and output happen to use the same encoding, which you can't be sure of.
So in short, no, strings shouldn't be Unicode. Text should be. But YMMV.
Specifically:
No need to use Unicode here. You know if that string is ASCII or not.
Depends if you need to merge those strings with Unicode or not.
Both ways work. But do not encode decode "when required". Decode ASAP, encode as late as possible. Using codecs work well (or io, from Python 2.7).
Yeah.
IMHO (my simple rules):
Should I do a:
print u'Some text' or just print 'Text' ?
Everything should be Unicode, does this mean, like say I have a tuple:
t = ('First', 'Second'), it should be t = (u'First', u'Second')?
Well, I use unicode literals only when I have some char above ASCII 128:
print 'New York', u'São Paulo'
t = ('New York', u'São Paulo')
When reading/ writing to a file, I should use the codecs module. Right? Or should I just use the standard way or reading/ writing and encode or decode where required?
If you expect unicode text, use codecs.
If I get the string from say raw_input(), should I convert that to Unicode also?
Only if you expect unicode text that may get transfered to another system with distinct default encoding (including databases).
EDITED (about mixing unicode and byte strings):
>>> print 'New York', 'to', u'São Paulo'
New York to São Paulo
>>> print 'New York' + ' to ' + u'São Paulo'
New York to São Paulo
>>> print "Côte d'Azur" + ' to ' + u'São Paulo'
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
>>> print "Côte d'Azur".decode('utf-8') + ' to ' + u'São Paulo'
Côte d'Azur to São Paulo
So if you mix a byte string that contains utf-8 (or other non ascii char) with unicode text without explicit conversion, you will have trouble, because default assumes ascii. The other way arround seems to be safe. If you follow the rule of writing every string containing non-ascii as an unicode literal, you should be OK.
DISCLAIMER: I live in Brazil where people speak Portuguese, a language with lots of non-ascii chars. My default encoding is always set to 'utf-8'. Your mileage may vary in English/ascii systems.
I’m just adding my personal opinion here. Not as long and elaborate at the other answers, but maybe it can help, too.
print u'Some text' or just print 'Text' ?
I’d indeed prefer the first. If you know that you only have Unicode strings, you have one invariant more. Various other languages (C, C++, Perl, PHP, Ruby, Lua, …) sometimes encounter painful problems because of their lack of separation between code unit sequences and integer sequences. I find the approach of strict distinction between them that exists in .NET, Java, Python etc. quite a bit cleaner.
Everything should be Unicode, does this mean, like say I have a tuple:
t = ('First', 'Second'), it should be t = (u'First', u'Second')?
Yes.
I read that I can do a from __future__ import unicode_literals and then every string will be a Unicode string, but should I do this inside a container also?
Yes. Future statements apply only to the file where they’re used, so you can use them without interfering with other modules. I generally import all futures in Python 2.x modules to make the transition to 3.x easier.
When reading/ writing to a file, I should use the codecs module. Right? Or should I just use the standard way or reading/ writing and encode or decode where required?
You should use the codecs module because that makes it impossible (or at least very hard) to accidentally write differently-encoded representations to a single file. It is also the way Python 3.x works when you open a file in text mode.
If I get the string from say raw_input(), should I convert that to Unicode also?
I’d say yes to this too: In most cases it’s easier to deal with only one encoding, so I recommend converting to Python Unicode strings as early as possible.
What is the common approach to handling all of the above issues in 2.x? The from __future__ import unicode_literals statement?
I don’t know what the common approach is, but I use that statement all the time. I have encountered only very few issues with this approach, and most of them are related to bugs in external libraries—i.e., NumPy sometimes requires byte strings without documenting that.
The fact that you were writing Python code for 6 months before encountering anything about Unicode means that the Python 2.x ASCII default for strings didn't cause you any problems. Certainly for a beginner to try to grasp the idea of Unicode/code points/encoding in itself is a hard issue to tackle; therefore, most tutorials naturally bypass it until you get more of a grounding in the fundamentals. That's why in a book like Dive Into Python, Unicode is only mentioned in later chapters.
If you need to support Unicode in your application, I suggest looking at Kumar McMillan's PyCon 2008 talk on Unicode for a list of best practices. It should answer your remaining questions.
1/2) Personally I've never heard of "always use unicode". That seems pretty stupid to me. I guess I understand if you plan to support other languages that need unicode support. But other than that I wouldn't do that, it seems like more of a pain than it's worth.
3) I would just read/write the standard way and encode when necessary.

Categories

Resources