Python 2.7, Appengine Data Store & Unicode

Python 2.7, Appengine Data Store & Unicode - python

So I've been reading quite a bit about Unicoding tonight because I was thinking of switching to Jinja2, which requires Unicode to be used everywhere within the app. I think I have a good idea of how to deal with it, but I wanted to hear if this is reasonable before I started to code my app:
Dealing with External Text-Inputs (via html forms)
a) Make sure all html pages are utf-8 encoded.
b) Once users press submit, make sure the data is converted into Unicode as soon as the python backend receives it...decode(self.request.get('stuff'),utf-8)
c) Stay in unicode, transfer the outputs to Jinja2 which will always it using the default encoding of utf-8.
Information from the appengine datastore
Because google stores everything as Unicode, all data coming in from the datastore is already unicode and I don't have to worry about anything (yay!)
Strings within the app
Make sure all "" start with a u (i.e. u"hello world"), this will force everything to be in unicode.
Well the above is my strategy to keep everything consistent. Is there anything else I need to account for?
thanks!

You should not need to .decode(self.request.get('stuff'),utf-8 if you using webapp or webapp2. The framework respects the input type of the data as specified.
Everything else looks right.
Also I believe that
from __future__ import unicode_strings
should be
from __future__ import unicode_literals
and is only available in 2.6 and 2.7 So in App Engine it would only be available if you are using 2.7

Related

GAE unicode character gets encoded to utf-8 bytes

In my app I'm accepting text from user inputs where users often paste text from microsoft word.
A good example being the apostrophe ’, which for some reason gets converted to =E2=80=99 when posting to my handler in google app engine. I've tried a number of confused ways to prevent this and I'm quite happy to simple remove these characters, some of these methods work in plain python but not in app engine.
here's some of what I've tried:
problem_string = re.sub(r'[^\x00-\x7F]+','', problem_string)# trying to remove it
problem_string = problem_string.encode( "utf-8" )# desperation...
problem_string = "".join((c if ord(c) < 128 else '' for c in problem_string))# trying to just remove the thing
problem_string = unicode(problem_string, "utf8")# probably fails since its already unicode
... where I'm trying to capture the string including ’ and then later save it to the ndb datastore as a StringProperty(). Except for the last option, the apsotrophe example gets converted to =E2=80=99.
If I could save the apostrophe type character and display it again that would be great, but simply removing it would also serve my needs.
*Edit - the following:
experience = re.sub(r'[^\x00-\x7F]+',' ', experience)
seems to work fine on the dev server, and successfully removes the offending apostrophe.
Also what may be an issue is that the POST fields are going through the blobstore, so: blobstore_handlers.BlobstoreUploadHandler, which I think may being causing some problems.
I've really been bumping my head against this and I would really really appreciate an explanation from some clever stack-overflower...

Ok, I think I've vaguely stumbled upon a solution.
It had something to do with the blobstore upload handler, I guess it was encoding/decoding unicode appropriately to account for weird file characters. So I modified the handler so that the image file is uploaded via google cloud storage instead of the blobstore and it seems to work fine, i.e. the ’ gets to the datastore as ’ instead of =E2=80=99
I won't accept my own answer for the next few days, maybe someone can clarify things better for future confused individuals.

Importing from Oracle using the correct encoding with Python

I apologize for making a character encoding question since I know you folk get many everyday, but I couldn't figure out my problem so I asked anyway.
Here is what we are doing:
Take Data from an Oracle DB using Python and cx_Oracle.
Write the data to a file using Python.
Ingest the file into Postgres using Python and psycopg2.
Here are the important Oracle settings:
SQL> select * from NLS_DATABASE_PARAMETERS;
PARAMETER VALUE
------------------------------ ----------------------------------------
NLS_LANGUAGE AMERICAN
NLS_TERRITORY AMERICA
NLS_CURRENCY $
NLS_ISO_CURRENCY AMERICA
NLS_NUMERIC_CHARACTERS .,
NLS_CHARACTERSET US7ASCII
According to this NLS_LANG faq, you are meant to set the NLS_LANG according to what your client OS is using.
Running locale gives us: LANG=en_US.UTF-8 (all of the other fields were also en_US.UTF-8).
So, in our Python script, we set it like this:
os.environ["NLS_LANG"] = "AMERICAN_AMERICA.AL32UTF8"
Then we import the data and write it to a file.
row = cur.fetchall()
fil.write(row[0][0]) #For this test, I am only writing one row and one field.
We ingest that file into our UTF-8 Postgres DB.
Unfortunately, for some reason, we get this symbol: � in our file and the subsequent PG table as well. If my understanding is correct, this is the Replace Character. I believe that character is meant to show up if Unicode does not recognize a symbol.
(In some text editors, the symbol shows up as ï¿½).
What I don't understand is why is this happening? I thought UTF-8 was backwards compatible with 7-bit ASCII?
And even if we are using regional pages, shouldn't it still work, since the client is using US and the Oracle server is using AMERICAN?
How can I check if the data is imported correctly and if it isn't correct, how can I fix it so future imports are?
Note: The Oracle field is a CHAR field and not a NCHAR field.
Note2: We are using Python 2.4, so we don't have the native Unicode stuff in Python 3.X. So, it is possible that Python is messing up somewhere though I thought cx_Oracle took care of it all.
Thanks for your time, I hope you have a good day.

Unfortunately, for some reason, we get this symbol: � in our file and the subsequent PG table as well. If my understanding is correct, this is the Replace Character. I believe that character is meant to show up if Unicode does not recognize a symbol.
Mostly right but not quite. PostgreSQL will refuse to insert non-UTF8 text characters when using that encoding (do a search on StackOverflow for "Invalid UTF8 postgresql"). Most likely the character you are seeing is a valid UTF8 character that is not recognized by your font and therefore is showing the replacement character. If the symbol is in your Oracle db and is actually the replacement symbol there, then what do you want to replace it with? If that is the case, the information is already missing.
What I don't understand is why is this happening? I thought UTF-8 was backwards compatible with 7-bit ASCII?
It is.
How can I check if the data is imported correctly and if it isn't correct, how can I fix it so future imports are?
Most likely your problem is upstream of the Oracle db. I would find out what is actually inserting problem data into the Oracle db and fix it there. If you can check the data in Pg against the data in Oracle, you should be able to determine if the data is character for character the same (and flag any differences). That's how to check your current import.
Note2: We are using Python 2.4, so we don't have the native Unicode stuff in Python 3.X. So, it is possible that Python is messing up somewhere though I thought cx_Oracle took care of it all.
That's another possibility. Personally for file transformations I prefer Perl because of integrated regular expressions and absolutely top rate PostgreSQL support. However I recognize your import routine may not be readily convertable at this point. I am a little more familiar with troubleshooting UTF8 conversion issues in Perl than in Python. I do wonder however if you can check the data that is coming out in binary format for such symbols.

Strange behaviour with BeautifulSoup and converting HTML entities

I have a strange problem with converting special characters from HTML. I have a Django project where text is stored HTML-encoded in a MySQL database. This is necessary, because I don't want to lose any formatting of the text.
In a preliminary step I must do operational things on the text like calculating positions, so I need to convert it first and clear it from all HTML-Tags. This is done by BeautifulSoup:
convertedText = str(BeautifulSoup(text.text, convertEntities=BeautifulSoup.HTML_ENTITIES))
convertedText = ''.join(BeautifulSoup(convertedText).findAll(text=True))
By working on my Django-default test-server everything works fine, but when I run it on my production server there are strange behaviors when converting special characters.
An example:
Test server
MySQL-Query gives me: <p>bassverstärker</p>
is correctly converted to: bassverstärker
Production server
MySQL-Query gives me: <p>bassverstärker</p>
This is is wrongly converted to: bassverst\ucc44rker
Somehow the ä is converted into \ucc44 and this results in a wrong character.
My configuration:
Test server:
Django build-in solution (python manage.py runserver)
BeautifulSoup 3.2.1
Python 2.6.5
Ubuntu 2.6.32-43-generic
Production server:
Cherokee 1.2.101
BeautifulSoup 3.2.1
python 2.7.3
Ubuntu 3.2.0-32-generic
Because I don't know at which level the error occurs, I would like to ask if anybody can help me with this. Many thanks in advance.

I found a way to fix this. I didn't know that BeautifulSoup has the builtin method getText(). When converting HTML through:
convertedText = BeautifulSoup(text.text, convertEntities=BeautifulSoup.HTML_ENTITIES).getText()
eveything works fine on both servers. Although this works, it would be interesting to know why both servers are behaving differently when working with the example in the question.
However, thanks to all.

Django Localization, HTML translation string are working but Not translating Python translation strings when switching language

I went through many posts but not able to resolve problem, may be problem is something else.
Application is using django and appengine
When I select a language (for ex: "Spanish (es)"), everything working perfectly fine even python translation strings.
But when I switch to some other language (for ex: "Japanese (ja)"), HTML is working but some python translation is still using "Spanish (es)" language (Previous Language).
In middleware classes, I am setting:
1. request.LANGUAGE_CODE
2. request.session['django_language']
3. settings.LANGUAGE_CODE (may be not required, but still updating)
4. request.COOKIE['django_language']
5. translation.activate('<lang>')
And in processing response, I am:
1. translation.deactivate()
2. translation.deactivate_all()
I am not sure, what exactly the problem?
But I guess, initially when application load, it configured itself with instruction in settings.py and whatever python script loads at that time, they are fixed in translation.
I use custom AUTH_USER_MODULE and AUTH_ADMIN_MODULE instead of django defined.
Any idea, what wrong am I doing?
Much appreciate your help.
Let me know, if you need more information on this.
Thanks

How do I create a web interface to a simple python script?

I am learning python. I have created some scripts that I use to parse various websites that I run daily (as their stats are updated), and look at the output in the Python interpreter. I would like to create a website to display the results. What I want to do is run my script when I go to the site, and display a sortable table of the results.
I have looked at Django and am part way through the tutorial, but it seems like an awful lot of overhead for what should be a simple problem. I know that I could just write a Python script to output simple HTML, but is that really the best way? I would like to be able to sort the table by various columns.
I have years of programming experience (C, Java, etc.), but have very little web development experience.
Thanks in advance.

Have you considered Flask? Like Tornado, it is both a "micro-framework" and a simple web server, so it has everything you need right out of the box. http://flask.pocoo.org/
This example (right off the homepage) pretty much sums up how simple the code can be:
from flask import Flask
app = Flask(__name__)
#app.route("/")
def hello():
return "Hello World!"
if __name__ == "__main__":
app.run()

If you are creating non-interactive pages, you can easily setup any modern web server to execute your python script as a CGI. Instead of loading a static file, your web server will return the output of your python script.
This isn't very sophisticated, but if you are simply returning the output without needing browser submitted date, this is the easiest way (scaling under load is a different story).
You don't even need the "cgi" module from python, if you aren't receiving any data from the browser. Anything more complicated than this and you should use a web framework.
Examples and other methods
Simple Example: hardest part is webserver configuration
mod_python: Cut down on CGI overhead (otherwise, apache execs the python interpreter for each hit)
python module cgi: sending data to your python script from the browser.
Sorting
Javascript side sorting: I've used this javascript library to add sortable tables. This is the easiest way to add sorting without requiring additional work or another HTTP GET.
Instructions:
Download this file
Add to your HTML
Add class="sortable" to any table you'd like to make sortable
Click on the headers to sort

You might consider Tornado if Django is too much overhead. I've used both and agree that, if you have something simple/small to do and don't already know Django, it's going to exponentially increase your time to production. On the other hand, you can 'get' Tornado in a couple of hours and get something relatively simple done in a day or two with no prior experience with it. At least, that's been my experience with it.
Note that Tornado is still a tradeoff: you get a lot of simplicity in exchange for the huge cornucopia of features and shortcuts you get w/ Django.
PS - in addition to being a 'micro-framework', Tornado is also its own web server, so there's no mucking with wsgi/mod-cgi/fcgi.... just write your request handlers and run it. Be sure to see the demos included in the distribution.

Have you seen bottle framework? It is a micro framework and very simple.

If I correctly understood your requirements you might find Wooey very interesting.
Wooey is a A Django app that creates automatic web UIs for Python scripts:
http://wooey.readthedocs.org
Here you can check a demo:
https://wooey.herokuapp.com/

Django is a big webframework, meant to include loads of things becaus eyou often needs them, even though sometimes you don't.
Look at Pyramid, earlier known as BFG. It's much smaller.
http://pypi.python.org/pypi/pyramid/1.0a1
Other microframeworks to check out are here: http://wiki.python.org/moin/WebFrameworks
On the other hand, in this case it's probably also overkill. sounds like you can run the script once every ten minites, and write a static HTML file, and just use Apache.

If you are not willing to write your own tool, there is a pretty advanced tool for executing your scripts: http://rundeck.org/
It's pretty simple to start and can be configured for complex scenarios as well.
For the requirement of custom view (with sortable results), I believe you can implement a simple plugin for translating script output into html elements.
Also, for simple setups I could recommend my own tool: https://github.com/bugy/script-server. It doesn't have tons of features, but very easy for end-users and supports interactive execution.

If you don't need any input from the browser, this sounds like an almost-static webpage that just happens to change once a day. You'll only need some way to get html out of your script, in a place where your webserver can access it.)
So you'd use some form of templating; if you'll need some structure above the single page, there's static site / blog generators that you can feed your output in, say, Markdown format, and call their make html or the like.

You can use DicksonUI https://dicksonui.gitbook.io
DicksonUI is better
Or Remi gui(search in google)
DicksonUI is better.
I am the author of DicksonUI

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.