I have written a basic script that imports several thousand values into a Django database. Here's how it looks like: link.
Those locations are in Cyrillic letters, and are represented as unicode literals. However, as soon as I save them to the database, they are converted to what seems to be encoded simple strings, in some sort of hex encoding:
>>> Region.objects.all()[0].parent
'\xd0\xbe\xd0\xb1\xd0\xbb\xd0\xb0\xd1\x81\xd1\x82 \xd0\xa1\xd0\xbb\xd0\xb8\xd0\xb2\xd0\xb5\xd0\xbd'
Surprisingly, they appear correctly in the admin panel, but I have trouble when trying to use them. How do I store and retrieve them as unicode?
I'm running Django 1.4.0 on top of MySQL, collation set to utf8_bin.
This is a Django/MySQL "bug". See issue #16052. It's actually documented here.
It looks like the data is being returned as a UTF-8 byte string rather than a Unicode string. Try decoding it:
>>> x='\xd0\xbe\xd0\xb1\xd0\xbb\xd0\xb0\xd1\x81\xd1\x82 \xd0\xa1\xd0\xbb\xd0\xb8\xd0\xb2\xd0\xb5\xd0\xbd'
>>> x.decode('utf-8')
u'\u043e\u0431\u043b\u0430\u0441\u0442 \u0421\u043b\u0438\u0432\u0435\u043d'
>>> print x.decode('utf-8')
област Сливен
Related
I want to use a API from a game and store the player and clan names in a local database. The names can contain all sorts of characters and emoticons. Here are just a few examples I found:
⭐💎
яαℓαηι
نکل
窝猫
鐵擊道遊隊
❤✖❤♠️♦️♣️✖
I use python for reading the api and write it into a mysql database. After that, I want to use the names on a Node.js web application.
What is the best way to encode those characters and how can I savely store them in the database, so that I can display them correcly afterwards?
I tried to encode the strings in python with utf-8:
>>> sample = '蛙喜鄉民CLUB'
>>> sample
'蛙喜鄉民CLUB'
>>> sample = sample.encode('UTF-8')
>>> sample
b'\xe8\x9b\x99\xe5\x96\x9c\xe9\x84\x89\xe6\xb0\x91CLUB'
and storing the encoded string in a mysql database with utf8mb4_unicode_ci character set.
When I store the string from above and select it inside mysql workbench it is displayed like this:
蛙喜鄉民CLUB
When I read this string from the database again in python (and store it in db_str) I get:
>>> db_str
èåéæ°CLUB
>>> db_str.encode('UTF-8')
b'\xc3\xa8\xc2\x9b\xc2\x99\xc3\xa5\xc2\x96\xc2\x9c\xc3\xa9\xc2\x84\xc2\x89\xc3\xa6\xc2\xb0\xc2\x91CLUB'
The first output is total gibberish, the second one with utf-8 looks mostly like the encoded string from above, but with added \xc2 or \xc3 between each byte.
How can I save such strings into mysql, so that I can read them again and display them correctly inside a python script?
Is my database collation utf8mb4_unicode_ci not suitable for such content? Or do I have to use another encoding?
As described by #abarnert in a comment to the question, the problem was that the library used for written the unicode strings didn't know that utf-8 should be used and therefor encoded the strings wrong.
After adding charset='utf8mb4' as parameter to the mysql connection the string get written correctly in the intended encoding.
All I had to change was
conn = MySQLdb.connect(host, user, pass, db, port)
to
conn = MySQLdb.connect(host, user, pass, db, port, charset='utf8mb4')
and after that my approach described in the question worked flawlessly.
edit: after declaring the charset='utf8mb4' parameter on the connection object it is no longer necessary to encode the strings, as that gets now already successfully done by the mysqlclient library.
In my calling MySQL from Python I prepare it with "SET NAMES 'utf8'", but still something is not right. I get a sequence like this:
å½å®¶1级è¯ä¹¦
When I am supposed to get chinese characters, elsewhere always covered by utf8.
When I look at the utf8 code/sequence it clearly doesn't match the real one. Same sort of format, but different numbers.
Is this erroneous encoding on Python 2.7's end or bad programming on my end? I know Python 3.x has solved these issues but I cannot use the modules I want in later versions.
I know Python 2.7 can actually display chinese, by using the print operator, but it is otherwise stored and viewed as utf8-code. Look:
>>> '你好'
'\xc4\xe3\xba\xc3'
>>> print '\xc4\xe3\xba\xc3'
你好
Ok.. It seems adding
"SET NAMES 'gbk'"
before the MySQL SELECT query did the trick. Now at least the strings from my dictionary and from the sql database can be compared. It also seems that gbk is often the prefered char format in China.
I need to insert a series of names (like 'Alam\xc3\xa9') into a list, and than I have to save them into a SQLite database.
I know that I can render these names correctly by tiping:
print eval(repr(NAME)).decode("utf-8")
But I have to insert them into a list, so I can't use the print
Other way for doing this without the print?
Lots and lots of misconceptions here.
The string you quote is not Unicode. It is a byte string, encoded in UTF-8.
You can convert it to Unicode by decoding it:
unicode_name = name.decode('utf-8')
When you print the value of unicode_name to the console, you will see one of two things:
>>> unicode_name
u'Alam\xe9'
>>> print unicode_name
Alamé
Here, you can see that just typing the name and pressing enter shows a representation of the Unicode code points. This is the same as typing print repr(unicode_name). However, doing print unicode_name prints the actual characters - ie behind the scenes, it encodes it to the correct encoding for your terminal, and prints the result.
But this is all irrelevant, because Unicode strings can only be represented internally. As soon as you want to store it in a database, or a file, or anywhere, you need to encode it. And the most likely encoding to choose is UTF-8 - which is what it was in originally.
>>> name
'Alam\xc3\xa9'
>>> print name
Alamé
As you can see, using the original non-decoded version of the name, repr and print once again show the codes and the characters. So it's not that converting it to Unicode actually makes it any more "really" the correct character.
So, what to do if you want to store it in a database? Nothing. Nothing at all. Sqlite accepts UTF-8 input, and stores its data in UTF-8 format on the disk. So there is absolutely no conversion needed to store the original value of name in the database.
Are you looking for something like this?
[n.decode("utf-8") for n in ['Alam\xc3\xa9', 'Alam\xc3\xa9', 'Alam\xc3\xa9']]
I am storing some HTML in an SQLite3 database in Python.
When I go to insert some HTML into my SQL table I get an error that I don't understand what's wrong & more importantly how to fix the issue.
Error string:
Exception General: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
The HTML string I am inserting into the table is pretty long (about 700 characters long).
Any idea whats wrong & how I can fix this?
Looking at the answer to this question, it looks like your issue is that you are attempting to insert HTML with characters in it that do not map to ASCII. If you call unicode(my_problematic_html) you'll probably wind up with a UnicodeEncodingError. In that case you'll want to decode your problematic string representation to unicode by calling:
my_unicoded_html = my_problematic_html.decode("utf-8")
and then writing my_unicoded_html to the database.
You'll want to read Unicode In Python Completely Demystified.
* Please note, your HTML may be encoded in some other codec (format? ... charset?) than utf-8. latin-1 is also a good guess if you are on Windows (or if the HTML might be from a Windows machine).
I have a python sgi script that attempts to extract an rss items that is posted to it and store the rss in a sqlite3 db. I am using flup as the WSGIServer.
To obtain the posted content:
postData = environ["wsgi.input"].read(int(environ["CONTENT_LENGTH"]))
To attempt to store in the db:
from pysqlite2 import dbapi2 as sqlite
ldb = sqlite.connect("/var/vhost/mysite.com/db/rssharvested.db")
lcursor = ldb.cursor()
lcursor.execute("INSERT into rss(data) VALUES(?)", (postData,))
This results in only the first few characters of the rss being stored in the record:
ÿþ<
I believe the initial chars are the BOM of the rss.
I have tried every permutation I could think of including first encoding rss as utf-8 and then attempting to store but the results were the same. I could not decode because some characters could not be represented as unicode.
Running python 2.5.2
sqlite 3.5.7
Thanks in advance for any insight into this problem.
Here is a sample of the initial data contained in postData as modified by the repr function, written to a file and viewed with less:
'\xef\xbb\xbf
Thanks for the all the replies! Very helpful.
The sample I submitted didn't make it through the stackoverflow html filters will try again, converting less and greater than to entities (preview indicates this works).
\xef\xbb\xbf<?xml version="1.0" encoding="utf-16"?><rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><channel><item d3p1:size="0" xsi:type="tFileItem" xmlns:d3p1="http://htinc.com/opensearch-ex/1.0/">
Regarding the insertion encoding - in any decent database API, you should insert unicode strings and unicode strings only.
For the reading and parsing bit, I'd recommend Mark Pilgrim's Feed Parser. It properly handles BOM, and the license allows commercial use. This may be a bit too heavy handed if you are not doing any actual parsing on the RSS data.
Are you sure your incoming data are encoded as UTF-16 (otherwise known as UCS-2)?
UTF-16 encoded unicode strings typically include lots of NUL characters (surely for all characters existing in ASCII too), so UTF-16 data hardly can be stored in environment variables (env vars in POSIX are NUL terminated).
Please provide samples of the postData variable contents. Output them using repr().
Until then, the solid advice is: in all DB interactions, your strings on the Python side should be unicode strings; the DB interface should take care of all translations/encodings/decodings necessary.
Before the SQL insertion you should to convert the string to unicode compatible strings. If you raise an UnicodeError exception, then encode the string.encode("utf-8").
Or , you can autodetect encoding and encode it , on his encode schema. Auto detect encoding