GAE unicode character gets encoded to utf-8 bytes - python

In my app I'm accepting text from user inputs where users often paste text from microsoft word.
A good example being the apostrophe ’, which for some reason gets converted to =E2=80=99 when posting to my handler in google app engine. I've tried a number of confused ways to prevent this and I'm quite happy to simple remove these characters, some of these methods work in plain python but not in app engine.
here's some of what I've tried:
problem_string = re.sub(r'[^\x00-\x7F]+','', problem_string)# trying to remove it
problem_string = problem_string.encode( "utf-8" )# desperation...
problem_string = "".join((c if ord(c) < 128 else '' for c in problem_string))# trying to just remove the thing
problem_string = unicode(problem_string, "utf8")# probably fails since its already unicode
... where I'm trying to capture the string including ’ and then later save it to the ndb datastore as a StringProperty(). Except for the last option, the apsotrophe example gets converted to =E2=80=99.
If I could save the apostrophe type character and display it again that would be great, but simply removing it would also serve my needs.
*Edit - the following:
experience = re.sub(r'[^\x00-\x7F]+',' ', experience)
seems to work fine on the dev server, and successfully removes the offending apostrophe.
Also what may be an issue is that the POST fields are going through the blobstore, so: blobstore_handlers.BlobstoreUploadHandler, which I think may being causing some problems.
I've really been bumping my head against this and I would really really appreciate an explanation from some clever stack-overflower...

Ok, I think I've vaguely stumbled upon a solution.
It had something to do with the blobstore upload handler, I guess it was encoding/decoding unicode appropriately to account for weird file characters. So I modified the handler so that the image file is uploaded via google cloud storage instead of the blobstore and it seems to work fine, i.e. the ’ gets to the datastore as ’ instead of =E2=80=99
I won't accept my own answer for the next few days, maybe someone can clarify things better for future confused individuals.

Related

Elasticsearch indexing with Python UTF-8 problems

I am indexing data on elasticsearch using the official python library for this: elasticsearch-py. The data is directly taken from oracle using the cx_oracle python library, cast into a document format and send for indexing to elasticsearch. For the most part this works great, but sometimes I encounter problems with characters like ö. Sometimes this character is indexed as \xc3\xb8 and sometimes as ö. This happens even in the same database entry. One variable can have the ö indexed correct while for another variable this is not the case.
Does Anyone an idea what might cause this?
thanks in advance
If your "ö" is sometimes right - and sometimes not, the data must be corrupted in your database. This is not a problem of Elasticsearch. (I had the exact same problem one month ago!)
Strings with various encodings are likely put in your database without being all converted to a single format before.
text = "ö"
asUtf=text.encode('UTF-8')
print(asUtf)
print(asUtf.decode())
Result:
b'\xc3\xb6'
ö
This problem could be solved before the insertion into Elasticsearch. Find the text sequences matching '\xXX\xXX', treat them as UTF-8 and decode them to unicode. Try to sanitize you database and fix the way you put information inside.
PS: a better practice to move information from a database to Elasticsearch is to use rivers or to make a script that would directly send the data to Elasticsearch, without saving them into a file first.
2016 edit: the rivers are deprecated now, so you should find an alternative like logstash.

using unicode strings with white space as Django url variable

Is there a problem with using unicode (hebrew specificaly) strings including white space.
some of them also include characters such as "%" .
I'm experiencing some problems and since this is my first Django project I want to rule out this as a problem before going further into debugging.
And if there is a known Django problem with this kind of urls is there a way around it?
I know I can reformat the text to solve some of those problems but since I'm preparing a site that uses raw open government data sets (perfectly legal) I would like to stick to the original format as possible.
thanks for the help
Django shouldn't have any problems with unicode URLs, or whitespace in URLs for that matter (although you might want to take care to make sure whitespace is urlecoded (%20).
Either way, though, using white space in a URL is just bad form. It's not guaranteed to work unless it's urlencoded, and then that's just one more thing to worry about. Best to make any field that will eventually become part of a URL a SlugField (so spaces aren't allowed to begin with) or run the value through slugify before placing it in the URL:
In template:
http://domain.com/{{ some_string_with_spaces|slugify }}/
Or in python code:
from django.template.defaultfilters import slugify
u'http://domain.com/%s/' % slugify(some_string_with_spaces)
Take a look here for a fairly comprehensive discussion on what makes an invalid (or valid) URL.

Python 2.7, Appengine Data Store & Unicode

So I've been reading quite a bit about Unicoding tonight because I was thinking of switching to Jinja2, which requires Unicode to be used everywhere within the app. I think I have a good idea of how to deal with it, but I wanted to hear if this is reasonable before I started to code my app:
Dealing with External Text-Inputs (via html forms)
a) Make sure all html pages are utf-8 encoded.
b) Once users press submit, make sure the data is converted into Unicode as soon as the python backend receives it...decode(self.request.get('stuff'),utf-8)
c) Stay in unicode, transfer the outputs to Jinja2 which will always it using the default encoding of utf-8.
Information from the appengine datastore
Because google stores everything as Unicode, all data coming in from the datastore is already unicode and I don't have to worry about anything (yay!)
Strings within the app
Make sure all "" start with a u (i.e. u"hello world"), this will force everything to be in unicode.
Well the above is my strategy to keep everything consistent. Is there anything else I need to account for?
thanks!
You should not need to .decode(self.request.get('stuff'),utf-8 if you using webapp or webapp2. The framework respects the input type of the data as specified.
Everything else looks right.
Also I believe that
from __future__ import unicode_strings
should be
from __future__ import unicode_literals
and is only available in 2.6 and 2.7 So in App Engine it would only be available if you are using 2.7

Robust way to put contents of any arbitrary text file in the database (using Django/Python)?

As part of my Django app, I have to get the contents of a text file which a user uploads (which could be any charset) and save it to my DB. I keep running into issues (like having to remove UTF8's BOM manually, or having to figure out how to account for non-printable characters, or having to figure out how to make all unicode characters work - not just Latin ones, etc.) and each of these issues requires its own hack.
Is there a robust way to do this that doesn't require each of these case-by-case fixes? Right now I'm just using file.read() to get the contents, then doing all of those workarounds to clean the contents, and then using .save() to save it to the DB (I have a model for this).
What else can I be doing?
Causes some overhead, but you could base64 encode the entire string before persisting to the db. Then no escaping is required.
If you want to explicitly steer away from any issues with encoding and just see files as bunches of binary data (not strings of text in a specific encoding) you might want to use your database's binary format.
For MySQL this is BINARY and VARBINARY: http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html
For a deeper understanding of unicode & utf-8 issues (recommended) this is a nice read on the subject:
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

How to handle unicode of an unknown encoding in Django?

I want to save some text to the database using the Django ORM wrappers. The problem is, this text is generated by scraping external websites and many times it seems they are listed with the wrong encoding. I would like to store the raw bytes so I can improve my encoding detection as time goes on without redoing the scrapes. But Django seems to want everything to be stored as unicode. Can I get around that somehow?
You can store data, encoded into base64, for example. Or try to analize HTTP headers from browser, may be it is simplier to get proper encoding from there.
Create a File with the data. Use a Django models.FileField to hold a reference to the file.
No it does not involve a ton of I/O. If your file is small it adds 2 or 3 I/O's (the directory read, the iNode read and the data read.)

Categories

Resources