Unicode Hell in Pyramid: MySQL -> SQLAlchemy -> Pyramid -> JSON

Unicode Hell in Pyramid: MySQL -> SQLAlchemy -> Pyramid -> JSON - python

Background
I'm in a real mess with unicode and Python. It seems to be a common angst and I've tried using other solutions out there but I just can't get my head around it.
Setup
MySQL Database Setup
collation_database: utf8_general_ci
character_set_database: utf8
SQLAlchemy Model
class Product(Base):
id = Column('product_id', Integer, primary_key=True)
name = Column('product_name', String(64)) #Tried using Unicode() but didn't help
Pyramid View
#view_config(renderer='json', route_name='products_search')
def products_search(request):
json_products = []
term = "%%%s%%" % request.params['term']
products = dbsession.query(Product).filter(Product.name.like(term)).all()
for prod in products:
json_prod = {'id': prod.id, 'label': prod.name, 'value': prod.name, 'sku': prod.sku, 'price': str(prod.price[0].price)}
json_products.append(json_prod)
return json_products
Problem
I get encoding errors reported from the json module (which is called as its the renderer for this route) like so:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 37: invalid start byte
The culprit is a "-" (dash symbol) in the prod.name value. Full stack trace here. If the returned products don't have a "-" in then it all works fine!
Tried
I've tried encoding, decoding with various types before returning the json_products variable.

The above comment is right, but more specifially, you can replace 'label': prod.name with 'label': prod.name.decode("cp1252"). You should probably also do this for all of the strings in your json_prod dictionary since it's likely you'll see cp1252 encoded characters elsewhere in the real-world use of your application.
On that note, depending on the source of these strings and how widely that source is used in your app, you may run into such problems elsewhere in your app and generally when you least expect it. To investigate further you may want to figure What the source for these strings is and if you can do the decoding/re-encoding at a lower level to correct most future problems with this.

Related

Python response object returns unexpected utf code points

I am attempting to retrieve a response in the form of an xml byte string from a web service using the Response library, parse it with an XSLT in memory, and write the output to Teradata using pyodbc. I am using python 3.
Any letter beyond the standard ascii 127 code point appears to come in as two characters. For example ü comes over as Ã¼ into the database.
’ comes over as â and two unprintable characters after it (PAD,and SGCI according to notepad++)
There is also the possibility Kanji characters in the response.
My code:
cnxn.pyodbc.connect('mydsn')
cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
cnxn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
cnxn.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-8')
cnxn.setencoding(encoding='utf-8')
cursor = cnxn.cursor()
transform = etree.XSLT(etree.fromstring(xsltasstring))
p = etree.XML(response.content)
result_tree = transform(p)
linefeed = '\n'
recordseperator = ','
output_list = str(result_tree).split(linefeed)
insert_sql = 'insert into the_table(.....)'
for item in output_list:
temp_list = item.split(recordseperator)
final_list.append(temp_list)
print(item)
print(item.encode())
cursor.executemany(insert_sql,final_list)
When I look in the database, an example string: ação has been translated to aÃ§Ã£o
In the loop,
print(item) outputs correctly: ação`
print(final_list) also looks correct: ação
print(item.encode()) outputs b'a\xc3\xa7\xc3\xa3o'
c3,a7,c3,x3 corresponds to Ã§Ã£. I do not understand what is causing this, or how to fix it.
I would have expected to see something like: b'a\xe7\xe3o'
Based on an odbc trace I do see those code points c3,a7, and x3 are passed to the buffer.
The response from the web service does have the UTF-8 encoding declaration, and the web service vendor also says the response encoding is UTF-8.

Upgrading to the Teradata 16.20 drivers stopped this problem, but I never was able to figure out exactly why this was happening.

Python encoding errors

I try to scrape website and print meta fields. I got an UnicodeEncodeError but I resolved it by using chcp 65001 in my terminal (I am using Windows). Now it works fine but some sites gives me strange results. I get "Ä‡wiczenia" instead of "Ćwiczenia". Other sites gives proper values ("Ćwiczenia" for example).
Why one time it is ok and another time it is not?
This is my method:
def description(self):
description = self.soup.find_all(attrs={'name':
['description', 'Description']})
if description:
return(description[0]['content'])
It gives me good result on one page. On another
if description:
return(description[0]['content'].encode("windows-1252").decode("utf-8"))
fix it (I get proper encoding) but when I open previous site with this method I get an error:
"'charmap' codec can't encode character '\u015b' in position 69"
How can I solve this?

General Python Unicode / ASCII Casting Issue Causing Trouble in Pyorient

UPDATE: I opened an issue on github based on a Ivan Mainetti's suggestion. If you want to weigh in there, it is :https://github.com/orientechnologies/orientdb/issues/6757
I am working on a database based on OrienDB and using a python interface for it. I've had pretty good luck with it, but I've run into a problem that seems to be the driver's (pyorient) wonkiness when dealing with certain unicode characters.
The data structure I'm uploading to the database looks like this:
new_Node = {'#Nodes':
{
"Abs_Address":Ono.absolute_address,
'Content':Ono.content,
'Heading':Ono.heading,
'Type':Ono.type,
'Value':Ono.value
}
}
I have created literally hundreds of records flawlessly on OrientDB / pyorient. I don't think the problem is necessarily a pyorient specific question, however, as I think the reason it is failing on a particular record is because the Ono.absolute_address element has a unicode character that pyorient is somehow choking on.
The record I want to create has an Abs_address of /u/c/2/a1–2, but the node I get when I pass the value to the my data structure above is this:
{'#Nodes': {'Content': '', 'Abs_Address': u'/u/c/2/a1\u20132', 'Type': 'section', 'Heading': ' Transferred', 'Value': u'1\u20132'}}
I think that somehow my problem is python is mixing unicode and ascii strings / chars? I'm a bit new to python and not declaring types, so I'm hoping this isn't an issue with pyorient perse given that the new_Node object doesn't output the properly formatted string...? Or is this an instance of pyorient not liking unicode? I'm tearing my hair out on this one. Any help is appreciated.
In case the error is coming from pyorient and not some kind of text encoding, here's the pyorient-related info. I am creating the record using this code:
rec_position = self.pyo_client.record_create(14, new_Node)
And this is the error I'm getting:
com.orientechnologies.orient.core.storage.ORecordDuplicatedException - Cannot index record Nodes{Content:,Abs_Address:null,Type:section,Heading: Transferred,Value:null}: found duplicated key 'null' in index 'Nodes.Abs_Address' previously assigned to the record #14:558
The error is odd as it suggests that the backend database is getting a null object for the address. Apparently it did create an entry for this "address," but it's not what I want it to do. I don't know why address strings with unicode are coming up null in the database... I can create it through orientDB studio using the exact string I fed into the new_Node data structure... but I can't use python to do the same thing.
Someone help?
EDIT:
Thanks to Laurent, I've narrowed the problem down to something to do with unicode objects and pyorient. Whenever a the variable I am passing is type unicode, the pyorient adapter sends a null value to the OrientDB database. I determined the value that is causing the problem is an ndash symbol, and Laurent helped me replace it with a minus sign using this code
.replace(u"\u2013",u"-")
When I do that, however, pyorient gets unicode objects which it then passes as null values... This is not good. I can fix this short term by recasting the string using str(...) and this appears to solve my immediate problem:
str(Ono.absolute_address.replace(u"\u2013",u"-"))
. Problem is, I know I will have symbols and other unusual characters in my DB data. I know the database supports the unicode strings because I can add them manually or use SQL syntax to do what I cannot do via pyorient and python... I am assuming this is a dicey casting issue somewhere, but I'm not really sure where. This seems very similar to this problem: http://stackoverflow.duapp.com/questions/34757352/how-do-i-create-a-linked-record-in-orientdb-using-pyorient-library
Any pyorient people out there? Python gods? Lucky s0bs? =)

I have tried your example on Python 3 with the development branch of pyorient with the latest version of OrientDB 2.2.11. If I pass the values without escaping them, your example seems to work for me and I get the right values back.
So this test works:
def test_test1(self):
new_Node = {'#Nodes': {'Content': '',
'Abs_Address': '/u/c/2/a1–2',
'Type': 'section',
'Heading': ' Transferred',
'Value': u'1\u20132'}
}
self.client.record_create(14, new_Node)
result = self.client.query('SELECT * FROM V where Abs_Address="/u/c/2/a1–2"')
assert result[0].Abs_Address == '/u/c/2/a1–2'
I think you may be saving the unicode value as an escaped value and that's where things get tricky.
I don't trust replacing values myself so I usually escape the unicode values I send to orientdb with the following code:
import json
def _escape(string):
return json.dumps(string)[1:-1]
The following test would fail because the escaped value won't match the escaped value in the DB so no record will be returned:
def test_test2(self):
new_Node = {'#Nodes': {'Content': '',
'Abs_Address': _escape('/u/c/2/a1–2'),
'Type': 'section',
'Heading': ' Transferred',
'Value': u'1\u20132'}
}
self.client.record_create(14, new_Node)
result = self.client.query('SELECT * FROM V where Abs_Address="%s"' % _escape('/u/c/2/a1–2'))
assert result[0].Abs_Address.encode('UTF-8').decode('unicode_escape') == '/u/c/2/a1–2'
In order to fix this, you have to escape the value twice:
def test_test3(self):
new_Node = {'#Nodes': {'Content': '',
'Abs_Address': _escape('/u/c/2/a1–2'),
'Type': 'section',
'Heading': ' Transferred',
'Value': u'1\u20132'}
}
self.client.record_create(14, new_Node)
result = self.client.query('SELECT * FROM V where Abs_Address="%s"' % _escape(_escape('/u/c/2/a1–2')))
assert result[0].Abs_Address.encode('UTF-8').decode('unicode_escape') == '/u/c/2/a1–2'
This test will succeed because you will now be asking for the escaped value in the DB.

Testing request parameters in Django ("+" behaves differently)

I have a Django View that uses a query parameter to do some content filtering. Something like this:
/page/?filter=one+and+two
/page/?filter=one,or,two
I have noticed that Django converts the + to a space (request.GET.get('filter') returns one and two), and I´m OK with that. I just need to adjust the split() function I use in the View accordingly.
But...
When I try to test this View, and I call:
from django.test import Client
client = Client()
client.get('/page/', {'filter': 'one+and+two'})
request.GET.get('filter') returns one+and+two: with plus signs and no spaces. Why is this?
I would like to think that Client().get() mimics the browser behaviour, so what I would like to understand is why calling client.get('/page/', {'filter': 'one+and+two'}) is not like browsing to /page/?filter=one+and+two. For testing purposes it should be the same in my opinion, and in both cases the view should receive a consistent value for filter: be it with + or with spaces.
What I don´t get is why there are two different behaviours.

The plusses in a query string are the normal and correct encoding for spaces. This is a historical artifact; the form value encoding for URLs differs ever so slightly from encoding other elements in the URL.
Django is responsible for decoding the query string back to key-value pairs; that decoding includes decoding the URL percent encoding, where a + is decoded to a space.
When using the test client, you pass in unencoded data, so you'd use:
client.get('/page/', {'filter': 'one and two'})
This is then encoded to a query string for you, and subsequently decoded again when you try and access the parameters.

This is because the test client (actually, RequestFactory) runs django.utils.http.urlencode on your data, resulting in filter=one%2Band%2Btwo. Similarly, if you were to use {'filter': 'one and two'}, it would be converted to filter=one%20and%20two, and would come into your view with spaces.
If you really absolutely must have the pluses in your query string, I believe it may be possible to manually override the query string with something like: client.get('/page/', QUERY_STRING='filter=one+and+two'), but that just seems unnecessary and ugly in my opinion.

Save unicode in redis but fetch error

I'm using mongodb and redis, redis is my cache.
I'm caching mongodb objects with redis-py:
obj in mongodb: {u'name': u'match', u'section_title': u'\u6d3b\u52a8', u'title':
u'\u6bd4\u8d5b', u'section_id': 1, u'_id': ObjectId('4fb1ed859b10ed2041000001'), u'id': 1}
the obj fetched from redis with hgetall(key, obj) is:
{'name': 'match', 'title': '\xe6\xaf\x94\xe8\xb5\x9b', 'section_title':
'\xe6\xb4\xbb\xe5\x8a\xa8', 'section_id': '1', '_id': '4fb1ed859b10ed2041000001', 'id': '1'}
As you can see, obj fetched from cache is str instead of unicode, so in my app, there is error s like :'ascii' codec can't decode byte 0xe6 in position 12: ordinal not in range(128)
Can anyone give some suggestions? thank u

I think I've discovered the problem. After reading this, I had to explicitly decode from redis which is a pain, but works.
I stumbled across a blog post where the author's output was all unicode strings which was obv different to mine.
Looking into the StrictRedis.__init__ there is a parameter decode_responses which by default is False. https://github.com/andymccurdy/redis-py/blob/273a47e299a499ed0053b8b90966dc2124504983/redis/client.py#L446
Pass in decode_responses=True on construct and for me this FIXES THE OP'S ISSUE.

Update, for global setting, check jmoz's answer.
If you're using third-party lib such as django-redis, you may need to specify a customized ConnectionFactory:
class DecodeConnectionFactory(redis_cache.pool.ConnectionFactory):
def get_connection(self, params):
params['decode_responses'] = True
return super(DecodeConnectionFactory, self).get_connection(self, params)
Assuming you're using redis-py, you'd better to pass str instead of unicode to Redis, or else Redis will encode it automatically for *set commands, normally in UTF-8. For the *get commands, Redis has no idea about the formal type of a value and has to just return the value in str directly.
Thus, As Denis said, the way that you storing the object to Redis is critical. You need to transform the value to str to make the Redis layer transparent for you.
Also, set the default encoding to UTF-8 instead of using ascii

for each string you can use the decode function to transform it in utf-8, e.g. for the value if the title field in your code:
In [7]: a='\xe6\xaf\x94\xe8\xb5\x9b'
In [8]: a.decode('utf8')
Out[8]: u'\u6bd4\u8d5b'

I suggest you always encode to utf-8 before writing to MongoDB or Redis (or any external system). And that you decode('utf-8') when you fecth results, so that you always work with Unicode in Python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.