Using utf-8 characters in a Jinja2 template - python

I'm trying to use utf-8 characters when rendering a template with Jinja2. Here is how my template looks like:
<!DOCTYPE HTML>
<html manifest="" lang="en-US">
<head>
<meta charset="UTF-8">
<title>{{title}}</title>
...
The title variable is set something like this:
index_variables = {'title':''}
index_variables['title'] = myvar.encode("utf8")
template = env.get_template('index.html')
index_file = open(preview_root + "/" + "index.html", "w")
index_file.write(
template.render(index_variables)
)
index_file.close()
Now, the problem is that myvar is a message read from a message queue and can contain those special utf8 characters (ex. "Séptimo Cine").
The rendered template looks something like:
...
<title>S\u00e9ptimo Cine</title>
...
and I want it to be:
...
<title>Séptimo Cine</title>
...
I have made several tests but I can't get this to work.
I have tried to set the title variable without .encode("utf8"), but it throws an exception (ValueError: Expected a bytes object, not a unicode object), so my guess is that the initial message is unicode
I have used chardet.detect to get the encoding of the message (it's "ascii"), then did the following: myvar.decode("ascii").encode("cp852"), but the title is still not rendered correctly.
I also made sure that my template is a UTF-8 file, but it didn't make a difference.
Any ideas on how to do this?

TL;DR:
Pass Unicode to template.render()
Encode the rendered unicode result to a bytestring before writing it to a file
This had me puzzled for a while. Because you do
index_file.write(
template.render(index_variables)
)
in one statement, that's basically just one line where Python is concerned, so the traceback you get is misleading: The exception I got when recreating your test case didn't happen in template.render(index_variables), but in index_file.write() instead. So splitting the code up like this
output = template.render(index_variables)
index_file.write(output)
was the first step to diagnose where exactly the UnicodeEncodeError happens.
Jinja returns unicode whet you let it render the template. Therefore you need to encode the result to a bytestring before you can write it to a file:
index_file.write(output.encode('utf-8'))
The second error is that you pass in an utf-8 encoded bytestring to template.render() - Jinja wants unicode. So assuming your myvar contains UTF-8, you need to decode it to unicode first:
index_variables['title'] = myvar.decode('utf-8')
So, to put it all together, this works for me:
# -*- coding: utf-8 -*-
from jinja2 import Environment, PackageLoader
env = Environment(loader=PackageLoader('myproject', 'templates'))
# Make sure we start with an utf-8 encoded bytestring
myvar = 'Séptimo Cine'
index_variables = {'title':''}
# Decode the UTF-8 string to get unicode
index_variables['title'] = myvar.decode('utf-8')
template = env.get_template('index.html')
with open("index_file.html", "wb") as index_file:
output = template.render(index_variables)
# jinja returns unicode - so `output` needs to be encoded to a bytestring
# before writing it to a file
index_file.write(output.encode('utf-8'))

Try changing your render command to this...
template.render(index_variables).encode( "utf-8" )
Jinja2's documentation says "This will return the rendered template as unicode string."
http://jinja.pocoo.org/docs/api/?highlight=render#jinja2.Template.render
Hope this helps!

Add the following lines to the beginning of your script and it will work fine without any further changes:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

Related

Twisted Web: problems encoding unicode

This is probably a stupid question/problem, but i could not find an answer for it. Also, it may not realy be twisted specific.
I am trying to write a resource for a twisted.web webserver, which should serve a page containing non-ascii characters.
According to this discusson, all i need to do is to set the Content-Type HTTP-Header and return an encoded string.
Unfortunately, the page shows invalid characters.
Here is the code (as a .rpy):
"""a unicode test"""
from twisted.web.resource import Resource
class UnicodeTestResource(Resource):
"""A unicode test resource."""
isLeaf = True
encoding = "utf-8"
def render_GET(self, request):
text = u"unicode test\n ä ö ü ß"
raw = u"<HTML><HEAD><TITLE>Unicode encoding test</TITLE><HEAD><BODY><P>{t}</P></BODY></HTML>".format(t=text)
enc = raw.encode(self.encoding)
request.setHeader("Content-Type", "text/html; charset=" + self.encoding)
return enc
resource = UnicodeTestResource()
The result (without the html) is: unicode test ä ö ü Ã.
Is this caused by an encoding mismatch between the server and the client?
I am using python 2.7.12 and twisted 17.1.0. The page has been accessed using firefox.
Sorry for my terrible english.
Thanks
EDIT: I found the problem. I assumed that twisted.web.static.File with a ResourceScript processor would use the encoding specified in the file in which the reactor is running.
Apparently this is not the case.
Adding # -*- coding: utf-8 -*- on the top of each file fixed the problem.

Spynner wrong encoding

I'm trying to download this page - https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8 (looks like this for me in Russia - http://screencloud.net/v/6a7o) via spynner in python - it uses some javascript checking so one does not simply download it without full browser emulation.
My code:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from StringIO import StringIO
import spynner
def log(str, filename_end):
filename = '/tmp/apple_log_%s.html' % filename_end
print 'logged to %s' % filename
f = open(filename, 'w')
f.write(str)
f.close()
debug_stream = StringIO()
browser = spynner.Browser(debug_level=3, debug_stream=debug_stream)
browser.load("https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8")
ret = browser.contents
log(ret, 'noenc')
print 'content length = %s' % len(ret)
browser.close()
del browser
f=open('/tmp/apple_log_debug', 'w')
f.write(debug_stream.getvalue())
f.close()
print 'log stored in /tmp/debug_log'
So, the problem is: either apple, either spynner work wrong with Cyrillic symbols. I see them fine if I try browser.show() after loading, but in the code and logs they are still wrong encoded like <meta content="ÐолÑÑиÑÑ Farm Story⢠в App Store. ÐÑоÑмоÑÑеÑÑ ÑкÑинÑоÑÑ Ð¸ ÑейÑинги, пÑоÑиÑаÑÑ Ð¾ÑзÑÐ²Ñ Ð¿Ð¾ÐºÑпаÑелей." property="og:description">.
http://2cyr.com/ Says that it is a utf-8 text displayed like iso-8859-1...
As you see - I don't use any headers in my request, but if I take them from chrome's network debug console and pass it to load() method e.g. headers=[('Accept-Encoding', 'utf-8'), ('Accept-Language', 'ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4')] - I get the same result.
Also, from the same network console you can see that chrome uses gzip,deflate,sdch as Accept-Encoding. I can try that too, but I fail to decode what I get: <html><head></head><body>��}ksÇ�g!���4�I/z�O���/)�(yw���é®i��{�<v���:��ٷ�س-?�b�b�� j�... even if I remove the tags from the begin and end of the result.
Any help?
Basically, browser.webframe.toHtml() returns a QTString in which case str() won't help if res actually has unicode non-latin characters.
If you want to get a Python unicode string you need to do:
ret = unicode(browser.webframe.toHtml().toUtf8(), encoding="UTF-8")
#if you want to get rid of non-latin text
ret = ret.encode("ascii", errors="replace") # encodes to bytestring
in case you suspect its in Russian you could decode it to a Russian multibyte oem string (sill a bytestring) by doing
ret = ret.encode("cp1251", errors="replace") # encodes to Win-1251
# or
ret = ret.encode("cp866", errors="replace") # encodes to windows/dos console
Only then you can save it to an ASCII file.
str(browser.webframe.toHtml()) saved me

Regex on unicode string

I am trying to download a few hundred Korean pages like this one:
http://homeplusexpress.com/store/store_view.asp?cd_express=3
For each page, I want to use a regex to extract the "address" field, which in the above page looks like:
*주소 : 서울시 광진구 구의1동 236-53
So I do this:
>>> import requests
>>> resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
>>> resp.encoding
'ISO-8859-1'
>>> # I wonder why it's ISO-8859-1, since I thought that is for Latin text (Latin-1).
>>> html = resp.text
>>> type(html)
<type 'unicode'>
>>> html
(outputs a long string that contains a lot of characters like \xc3\xb7\xaf\xbd\xba \xc0\xcd\xbd\xba\xc7\xc1\xb7\xb9\)
I then wrote a script. I set # -*- coding: utf-8 -*- on the .py file and put this:
address = re.search('주소', html)
However, re.search is returning None. I tried with and without the u prefix on the regex string.
Usually I can solve issues like this with a call to .encode or .decode but I tried a few things and am stuck. Any pointers on what I'm missing?
According to the tag in the html document header:
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
the web page uses the euc-kr encoding.
I wrote this code:
# -*- coding: euc-kr -*-
import re
import requests
resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
html = resp.text
address = re.search('주소', html)
print address
Then I saved it in gedit using the euc-kr encoding.
I got a match.
But actually there is an even better solution! You can keep the utf-8 encoding for your files.
# -*- coding: utf-8 -*-
import re
import requests
resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
resp.encoding = 'euc-kr'
# we need to specify what the encoding is because the
# requests library couldn't detect it correctly
html = resp.text
# now the html variable contains a utf-8 encoded unicode instance
print type(html)
# we use the re.search functions with unicode strings
address = re.search(u'주소', html)
print address
From requests documetation: When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers
If you check your website, we can see there is no encoding in server response:
I think the only option in this case is directly specify what encoding to use:
# -*- coding: utf-8 -*-
import requests
import re
r = requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
r.encoding = 'euc-kr'
print re.search(ur'주소', r.text, re.UNICODE)

Python flask flash message exception remains after restarting

I'm making a small flask app where I had something like this:
#app.route('/bye')
def logout():
session.pop('logged_in', None)
flash('Adiós')
return redirect('/index')
Needless to say when I ran the application and I navigated to '/bye' it gave me a UnicodeDecodeError. Well, now it gives me the same unicodedecodeerror on every page that extends the base template (which renders the messages) even after restarting the application. and always with the same dump() despite removing that flash in the source code. All I can think of is what the crap? Help please.
Well I had to restart my computer to clear the stupid session cache or something.
I think that flash() actually creates a session called session['_flashes']. See this code here. So you will probably have to either:
clear/delete the cookie
OR
session.pop('_flashes', None)
Flask flashing stores the messages in a session cookie until they are succesfully "consumed".
If you get a UnicodeDecodeError (https://wiki.python.org/moin/UnicodeDecodeError) in this case the messages is not consumed, so you get the error again and again.
My solution was to delete the cookie from the browser
Since I had the problem when using localization, I solved the cause now by installing my translation object like:
trans = gettext.GNUTranslations(...)
trans.install(unicode=True)
and having UTF-8 encoding in my python source files and "Content-Type: text/plain; charset=UTF-8\n" in the translation file (.pot)
You're using an non ascii string "adiós", so you need to ensure that python will process strings as unicode, not as ascii.
Add this to the header of your python file. This will tell the compiler that your file contains utf8 strings
#!/usr/bin/env python
# -*- coding: utf-8 -*-
so your code will be something like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from flask import Flask
app = Flask()
#app.route('/bye')
def logout():
session.pop('logged_in', None)
flash('Adiós')
return redirect('/index')

Wrong encoding of text, in Django?

"query" = джазовыми
For some reason...when I display it via:
{{ query|safe }}
I get this:
%u0434%u0436%u0430%u0437%u043E%u0432%u044B%u043C%u0438
Would the query be set from the source, this would solve it:
query = u"джазовыми"
(provided that for example your file encoding is utf-8 and you have corresponding line
# -*- coding: UTF-8 -*-
in the beginning)
But I guess the query is entered by user. The error seems to be located in that part of your code. Can you quote how it is done?

Categories

Resources