python convert unicode to readable character - python

I am using python 2.7 and psycopg2 for connecting to postgresql
I read a bunch of data from a source which has strings like 'Aéropostale'. I then store it in the database. However, in postgresql it is ending up as 'A\u00e9ropostale'. But I want it to get stored as 'Aéropostale'.
The encoding of postgresql database is utf-8.
Please tell me how can I store the actual string 'Aéropostale' instead.
I suspect that the problem is happening in python. Please advice.
EDIT:
Here is my data source
response_json = json.loads(response.json())
response is obtained via service call and looks like:
print(type(response.json())
>> <type'str'>
print(response.json())
>> {"NameRecommendation": ["ValueRecommendation": [{"Value": "\"Handmade\""}, { "Value": "Abercrombie & Fitch"}, {"Value": "A\u00e9ropostale"}, {"Value": "Ann Taylor"}}]
From the above data, my goal is to construct a list of all ValueRecommendation.Value and store in a postgresql json datatype column. So the python equivalent list that I want to store is
py_list = ["Handmade", "Abercrombie & Fitch", "A\u00e9ropostale", "Ann Taylor"]
Then I convert py_list in to json representation using json.dumps()
json_py_list = json.dumps(py_list)
And finally, to insert, I use psycopg2.cursor() and mogrify()
conn = psycopg2.connect("connectionString")
cursor = conn.cursor()
cursor.execute(cursor.mogrify("INSERT INTO table (columnName) VALUES (%s), (json_py_list,)))
As I mentioned earlier, using the above logic, string with special charaters like è are getting stored as utf8 character code.
Please spot my mistake.

json.dumps escapes non-ASCII characters by default so its output can work in non-Unicode-safe environments. You can turn this off with:
json_py_list = json.dumps(py_list, ensure_ascii=False)
Now you will get UTF-8-encoded bytes (unless you change that too with encoding=) so you'll need to make sure your database connection is using that encoding.
In general it shouldn't make any difference as both forms are valid JSON and even with ensure_ascii off there are still characters that get \u-encoded.

Related

Best practises when inserting a json variable into a MySQL table column of type json, using Python's pymysql library

I have a Python script, that's using PyMySQL to connect to a MySQL database, and insert rows in there. Some of the columns in the database table are of type json.
I know that in order to insert a json, we can run something like:
my_json = {"key" : "value"}
cursor = connection.cursor()
cursor.execute(insert_query)
"""INSERT INTO my_table (my_json_column) VALUES ('%s')""" % (json.dumps(my_json))
connection.commit()
The problem in my case is that the json is variable over which I do not have much control (it's coming from an API call to a third party endpoint), so my script keeps throwing new error for non-valid json variables.
For example, the json could very well contain a stringified json as a value, so my_json would look like:
{"key": "{\"key_str\":\"val_str\"}"}
→ In this case, running the usual insert script would throw a [ERROR] OperationalError: (3140, 'Invalid JSON text: "Missing a comma or \'}\' after an object member." at position 1234 in value for column \'my_table.my_json_column\'.')
Or another example are json variables that contain a single quotation mark in some of the values, something like:
{"key" : "Here goes my value with a ' quotation mark"}
→ In this case, the usual insert script returns an error similar to the below one, unless I manually escape those single quotation marks in the script by replacing them.
[ERROR] ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'key': 'Here goes my value with a ' quotation mark' at line 1")
So my question is the following:
Are there any best practices that I might be missing on, and that I can use in order to avoid my script breaking, in the 2 scenarios mentioned above, but also in any other potential examples of jsons that might break the insert query ?
I read some existing posts like this one here or this one, where it's recommended to insert the json into a string or a blob column, but I'm not sure if that's a good practice / if other issues (like string length limitations for example) might arise from using a string column instead of json.
Thanks !

Trying to save special characters in MySQL DB

I have a string that looks like this 🔴Use O Mozilla Que Não Trava! Testei! $vip ou $apoio
When I try to save it to my database with ...SET description = %s... and cursor.execute(sql, description) it gives me an error
Warning: (1366, "Incorrect string value: '\xF0\x9F\x94\xB4Us...' for column 'description' ...
Assuming this is an ASCII symbol, I tried description.decode('ascii') but this leads to
'str' object has no attribute 'decode'
How can I determine what encoding it is and how could I store anything like that to the database? The database is utf-8 encoded if that is important.
I am using Python3 and PyMySQL.
Any hints appreciated!
First, you need to make sure the table column has correct character set setting. If it is "latin1" you will not be able to store content that contains Unicode characters.
You can use following query to determine the column character set:
SELECT CHARACTER_SET_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA='your_database_name' AND TABLE_NAME='your_table_name' AND COLUMN_NAME='description'
Following Mysql document here if you want to change column character set.
Also, you need to make sure character set is properly configured for Mysql connection. Quoted from Mysql doc:
Character set issues affect not only data storage, but also
communication between client programs and the MySQL server. If you
want the client program to communicate with the server using a
character set different from the default, you'll need to indicate
which one. For example, to use the utf8 Unicode character set, issue
this statement after connecting to the server:
SET NAMES 'utf8';
Once character set setting is correct, you will be able to execute your sql statement. There is no need to encode / decode in Python side. That is used for different purposes.

Python MySQL insert and retrieve a list in Blob

I'm trying to insert a list of element into a MySQL database (into a Blob column). This is an example of my code is:
myList = [1345,22,3,4,5]
myListString = str(myList)
myQuery = 'INSERT INTO table (blobData) VALUES (%s)'
cursor.execute(query, myListString)
Everything works fine and I have my list stored in my database. But, when I want to retrieve my list, because it's now a string I have no idea how to get a real integer list instead of a string.
For example, if now i do :
myQuery = 'SELECT blobData FROM db.table'
cursor.execute(myQuery)
myRetrievedList = cursor.fetch_all()
print myRetrievedList[0]
I ll get :
[
instead of :
1345
Is there any way to transform my string [1345,22,3,4,5] into a list ?
You have to pick a data format for your list, common solutions in order of my preference are:
json -- fast, readable, allows nested data, very useful if your table is ever used by any other system. checks if blob is valid format. use json.dumps() and json.loads() to convert to and from string/blob representation
repr() -- fast, readable, works across Python versions. unsafe if someone gets into your db. user repr() and eval() to get data to and from string/blob format
pickle -- fast, unreadable, does not work across multiple architectures (afaik). does not check if blob is truncated. use cPickle.dumps(..., protocol=(cPickle.HIGHEST_PROTOCOL)) and cPickle.loads(...) to convert your data.
As per the comments in this answer, the OP has a list of lists being entered as the blob field. In that case, the JSON seems a better way to go.
import json
...
...
myRetrievedList = cursor.fetch_all()
jsonOfBlob = json.loads(myRetrievedList)
integerListOfLists = []
for oneList in jsonOfBlob:
listOfInts = [int(x) for x in oneList]
integerListOfLists.append(listOfInts)
return integerListOfLists #or print, or whatever

Unable to convert PostgreSQL text column to bytea

In my application I am using a postgresql database table with a "text" column to store
pickled python objects.
As database driver I'm using psycopg2 and until now I only passed python-strings (not unicode-objects) to the DB and retrieved strings from the DB. This basically worked fine until I recently decided to make String-handling the better/correct way and added the following construct to my DB-layer:
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
This basically works fine everywhere in my application and I'm using unicode-objects where possible now.
But for this special case with the text-column containing the pickled objects it makes troubles. I got it working in my test-system this way:
retrieving the data:
SELECT data::bytea, params FROM mytable
writing the data:
execute("UPDATE mytable SET data=%s", (psycopg2.Binary(cPickle.dumps(x)),) )
... but unfortunately I'm getting errors with the SELECT for some columns in the production-system:
psycopg2.DataError: invalid input syntax for type bytea
This error also happens when I try to run the query in the psql shell.
Basically I'm planning to convert the column from "text" to "bytea", but the error
above also prevents me from doing this conversion.
As far as I can see, (when retrieving the column as pure python string) there are only characters with ord(c)<=127 in the string.
The problem is that casting text to bytea doesn't mean, take the bytes in the string and assemble them as a bytea value, but instead take the string and interpret it as an escaped input value to the bytea type. So that won't work, mainly because pickle data contains lots of backslashes, which bytea interprets specially.
Try this instead:
SELECT convert_to(data, 'LATIN1') ...
This converts the string into a byte sequence (bytea value) in the LATIN1 encoding. For you, the exact encoding doesn't matter, because it's all ASCII (but there is no ASCII encoding).

Python: urllib.urlencode is escaping my stuff *twice*

... but it's not escaping it the same way twice.
I'm trying to upload ASCII output from gpg to a website. So, the bit I've got, so far, just queries the table, shows me the data it got, and then shows it to me after it encodes it for a HTTP POST request:
cnx = connect()
sql = ("SELECT Data FROM SomeTable")
cursor = cnx.cursor()
cursor.execute(sql)
for (data) in cursor:
print "encoding : %s" % data
postdata = urllib.urlencode( { "payload" : data } )
print "encoded as %s" % postdata
... but what I get is:
encoding : -----BEGIN PGP MESSAGE-----
Version: GnuPG v1.4.12 (GNU/Linux)
.... etc...
encoded as payload=%28u%27-----BEGIN+PGP+MESSAGE-----%5CnVersion%3A+GnuPG+v1.4.12+... etc ...
The part to notice is that the newlines aren't getting turned into %0A, like I'd expect. Instead, they're somehow getting escaped into "\n", and then the backslashes are escaped to %5C, so a newline becomes "%5Cn". Even stranger, the data gets prepended with %28u%27, which comes out to "(u'".
Oddly, if I just do a basic test with:
data = "1\n2"
print data
print urllib.urlencode( { "payload" : data } )
I get what I expect, newlines turn into %0A...
1
2
payload=1%0A2
So, my hunch is that the data element returned from the mysql query isn't the same kind of string as my literal "1\n2" (maybe a 1-element dict... dunno), but I don't have the Python kung-fu to know how to inspect it.
Anybody know what's going on, here, and how I can fix it? If not, any suggestions for how to POST this via HTTP with everything getting escaped properly?
Assuming connect() is a function from some DB-API 2.0 compatible database interface (like the built-in sqlite3, or the most popular mysql interface), for (data) in cursor: is iterating Row objects, not strings.
When you print it out, you're effectively printing str(data) (by passing it to a %s format). If you want to encode the same thing, you have to encode str(data).
However, a better way to do it is to handle the rows as rows (of one column) in the first place, instead of relying on str to do what you want.
PS, if you were trying to rely on tuple unpacking to make data the first element of each row, you're doing it wrong:
for (data) in cursor:
… is identical to:
for data in cursor:
If you want a one-element tuple, you need a comma:
for data, in cursor:
(You can also add the parens if you want, but they still don't make a difference either way.)
Specifically, iterating the cursor will call the optional __iter__ method, which returns the cursor itself, then loop calling the next method on it, which does the same thing as calling fetchone() until the result set is exhausted, and fetchone is documented to return "a single sequence", whose type isn't defined. In most implementations, that's a special row type, like sqlite3.Row, which can be accessed as if it were a tuple but has special semantics for things like printing in tabular format, allowing by-name access, etc.

Categories

Resources