Curious about unicode / string encoding in Python 3 - python

I'd like to ask why something works which I have found after painful hours of reading/trying to understand and in the end simply succesfull trial-and-error...
I'm on Linux (Ubuntu 13.04, German time formats etc., but english system language). My small python 3 script connects to a sqlite3 database of the reference manager Zotero. There I read a couple of keys with the goal of exporting files from the zotero storage directory (probably not important, and as said above, got it working).
All of this works fine with characters in the ascii set, but of course there are a lot of international authors in the database and my code used to fail on non-ascii authors/paper titles.
Perhaps first some info about the database on command line sqlite3:
sqlite3 zotero-test.sqlite
SQLite version 3.7.15.2 2013-01-09 11:53:05
sqlite> PRAGMA encoding;
UTF-8
Exemplary problematic entry:
sqlite> select * from itemattachments;
317|281|1|application/pdf|5|storage:Müller-Forell_2008_Orbitatumoren.pdf||2|1372357574000|2814ef3ea9c50cce2c32d6fb46b977bb
The correct name would be "storage:Müller-Forrell"; Zotero itself decodes this correctly, but SQLIte does not (at least dos not output it correctly in my terminal).
Google tells me that "ü" is a somehow incorrectly or not decoded latin-1/8859-1 "ü".
Reading this database entry in from python3 with
connection = sqlite3.connect("zotero-test.sqlite")`
cursor = connection.cursor()`
cursor.execute("SELECT itemattachments.itemID,itemattachments.sourceItemID,itemattachments.path,items.key FROM itemattachments,items WHERE mimetype=\"application/pdf\" AND items.itemID=itemattachments.itemID")
for pdf_result in cursor:
print(pdf_result[2])
print()
print(pdf_result[2].encode("latin-1").decode("utf-8"))
gives:
storage:Müller-Forell_2008_Orbitatumoren.pdf
storage:Müller-Forell_2008_Orbitatumoren.pdf
, the second being correct, so I got my script working (gosh how many hours this cost me...)
Can somebody explain to me what this construction of .encode and .decode does? Which one is even executed first?
Thanks for any clues,
Joost

The cursor yields strs. We run encode() on it to convert it to a bytes, and then decode it back into a str. It sounds like the data in the database is misencoded.

What you're seeing here is UTF8 data encoded in latin-1 stored in the SQLite database.
The sqlite module always returns unicode strings, so you first have to encode them into a unicode equivalent of latin-1 and then decode them as UTF8.
They shouldn't have been stored in the db as latin-1 to begin with.
You are executing encode before decode.

Related

How to encode international strings with emoticons and special characters for storing in database

I want to use a API from a game and store the player and clan names in a local database. The names can contain all sorts of characters and emoticons. Here are just a few examples I found:
⭐💎
яαℓαηι
نکل
窝猫
鐵擊道遊隊
❤✖❤♠️♦️♣️✖
I use python for reading the api and write it into a mysql database. After that, I want to use the names on a Node.js web application.
What is the best way to encode those characters and how can I savely store them in the database, so that I can display them correcly afterwards?
I tried to encode the strings in python with utf-8:
>>> sample = '蛙喜鄉民CLUB'
>>> sample
'蛙喜鄉民CLUB'
>>> sample = sample.encode('UTF-8')
>>> sample
b'\xe8\x9b\x99\xe5\x96\x9c\xe9\x84\x89\xe6\xb0\x91CLUB'
and storing the encoded string in a mysql database with utf8mb4_unicode_ci character set.
When I store the string from above and select it inside mysql workbench it is displayed like this:
蛙喜鄉民CLUB
When I read this string from the database again in python (and store it in db_str) I get:
>>> db_str
èåéæ°CLUB
>>> db_str.encode('UTF-8')
b'\xc3\xa8\xc2\x9b\xc2\x99\xc3\xa5\xc2\x96\xc2\x9c\xc3\xa9\xc2\x84\xc2\x89\xc3\xa6\xc2\xb0\xc2\x91CLUB'
The first output is total gibberish, the second one with utf-8 looks mostly like the encoded string from above, but with added \xc2 or \xc3 between each byte.
How can I save such strings into mysql, so that I can read them again and display them correctly inside a python script?
Is my database collation utf8mb4_unicode_ci not suitable for such content? Or do I have to use another encoding?
As described by #abarnert in a comment to the question, the problem was that the library used for written the unicode strings didn't know that utf-8 should be used and therefor encoded the strings wrong.
After adding charset='utf8mb4' as parameter to the mysql connection the string get written correctly in the intended encoding.
All I had to change was
conn = MySQLdb.connect(host, user, pass, db, port)
to
conn = MySQLdb.connect(host, user, pass, db, port, charset='utf8mb4')
and after that my approach described in the question worked flawlessly.
edit: after declaring the charset='utf8mb4' parameter on the connection object it is no longer necessary to encode the strings, as that gets now already successfully done by the mysqlclient library.

Python.27 - MySQL utf8 encoding

In my calling MySQL from Python I prepare it with "SET NAMES 'utf8'", but still something is not right. I get a sequence like this:
å½å®¶1级è¯ä¹¦
When I am supposed to get chinese characters, elsewhere always covered by utf8.
When I look at the utf8 code/sequence it clearly doesn't match the real one. Same sort of format, but different numbers.
Is this erroneous encoding on Python 2.7's end or bad programming on my end? I know Python 3.x has solved these issues but I cannot use the modules I want in later versions.
I know Python 2.7 can actually display chinese, by using the print operator, but it is otherwise stored and viewed as utf8-code. Look:
>>> '你好'
'\xc4\xe3\xba\xc3'
>>> print '\xc4\xe3\xba\xc3'
你好
Ok.. It seems adding
"SET NAMES 'gbk'"
before the MySQL SELECT query did the trick. Now at least the strings from my dictionary and from the sql database can be compared. It also seems that gbk is often the prefered char format in China.

saving unicode to database

I am trying to save the following string to mysql database using django (I got the string from somewhere else)
m.cr1 = u"\U0001F3C9" # cr1 is models.CharField(max_length=50)
m.save()
I get the error
Warning: Incorrect string value: '\xF0\x9F\x8F\x89' for column 'cr1' at row 1
I've looked on other related questions here, and change mysql to be utf8_unicode_ci, but this does not help. In general, my code works ok with unicode, but not in this specific case.
I guess that this is related to the fact that this is 32 bits unicode.
I actually just want to detect this case, and maybe to ignore the bad characters.
Any ideas?
Thanks
The MySQL utf8 is not the real UTF-8, but a modified one that only supports code points up to 0xFFFF. You are trying to use a code point (0x1F3C9 > 0xFFFF) that is not included in MySQL utf8.
You need to have relatively new version of MySQL, and change utf8 to utf8mb4. Everywhere.
The connection needs to be utf8mb4, the collation, the tables/columns and so on. Anywhere you have utf8 in MySQL context is wrong and needs to be utf8mb4.

Storing long string of HTML in SQLite database causes unknown error

I am storing some HTML in an SQLite3 database in Python.
When I go to insert some HTML into my SQL table I get an error that I don't understand what's wrong & more importantly how to fix the issue.
Error string:
Exception General: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
The HTML string I am inserting into the table is pretty long (about 700 characters long).
Any idea whats wrong & how I can fix this?
Looking at the answer to this question, it looks like your issue is that you are attempting to insert HTML with characters in it that do not map to ASCII. If you call unicode(my_problematic_html) you'll probably wind up with a UnicodeEncodingError. In that case you'll want to decode your problematic string representation to unicode by calling:
my_unicoded_html = my_problematic_html.decode("utf-8")
and then writing my_unicoded_html to the database.
You'll want to read Unicode In Python Completely Demystified.
* Please note, your HTML may be encoded in some other codec (format? ... charset?) than utf-8. latin-1 is also a good guess if you are on Windows (or if the HTML might be from a Windows machine).

What is the correct procedure to store a utf-16 encoded rss stream into sqlite3 using python

I have a python sgi script that attempts to extract an rss items that is posted to it and store the rss in a sqlite3 db. I am using flup as the WSGIServer.
To obtain the posted content:
postData = environ["wsgi.input"].read(int(environ["CONTENT_LENGTH"]))
To attempt to store in the db:
from pysqlite2 import dbapi2 as sqlite
ldb = sqlite.connect("/var/vhost/mysite.com/db/rssharvested.db")
lcursor = ldb.cursor()
lcursor.execute("INSERT into rss(data) VALUES(?)", (postData,))
This results in only the first few characters of the rss being stored in the record:
ÿþ<
I believe the initial chars are the BOM of the rss.
I have tried every permutation I could think of including first encoding rss as utf-8 and then attempting to store but the results were the same. I could not decode because some characters could not be represented as unicode.
Running python 2.5.2
sqlite 3.5.7
Thanks in advance for any insight into this problem.
Here is a sample of the initial data contained in postData as modified by the repr function, written to a file and viewed with less:
'\xef\xbb\xbf
Thanks for the all the replies! Very helpful.
The sample I submitted didn't make it through the stackoverflow html filters will try again, converting less and greater than to entities (preview indicates this works).
\xef\xbb\xbf<?xml version="1.0" encoding="utf-16"?><rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><channel><item d3p1:size="0" xsi:type="tFileItem" xmlns:d3p1="http://htinc.com/opensearch-ex/1.0/">
Regarding the insertion encoding - in any decent database API, you should insert unicode strings and unicode strings only.
For the reading and parsing bit, I'd recommend Mark Pilgrim's Feed Parser. It properly handles BOM, and the license allows commercial use. This may be a bit too heavy handed if you are not doing any actual parsing on the RSS data.
Are you sure your incoming data are encoded as UTF-16 (otherwise known as UCS-2)?
UTF-16 encoded unicode strings typically include lots of NUL characters (surely for all characters existing in ASCII too), so UTF-16 data hardly can be stored in environment variables (env vars in POSIX are NUL terminated).
Please provide samples of the postData variable contents. Output them using repr().
Until then, the solid advice is: in all DB interactions, your strings on the Python side should be unicode strings; the DB interface should take care of all translations/encodings/decodings necessary.
Before the SQL insertion you should to convert the string to unicode compatible strings. If you raise an UnicodeError exception, then encode the string.encode("utf-8").
Or , you can autodetect encoding and encode it , on his encode schema. Auto detect encoding

Categories

Resources