Twisted Web: problems encoding unicode - python

This is probably a stupid question/problem, but i could not find an answer for it. Also, it may not realy be twisted specific.
I am trying to write a resource for a twisted.web webserver, which should serve a page containing non-ascii characters.
According to this discusson, all i need to do is to set the Content-Type HTTP-Header and return an encoded string.
Unfortunately, the page shows invalid characters.
Here is the code (as a .rpy):
"""a unicode test"""
from twisted.web.resource import Resource
class UnicodeTestResource(Resource):
"""A unicode test resource."""
isLeaf = True
encoding = "utf-8"
def render_GET(self, request):
text = u"unicode test\n ä ö ü ß"
raw = u"<HTML><HEAD><TITLE>Unicode encoding test</TITLE><HEAD><BODY><P>{t}</P></BODY></HTML>".format(t=text)
enc = raw.encode(self.encoding)
request.setHeader("Content-Type", "text/html; charset=" + self.encoding)
return enc
resource = UnicodeTestResource()
The result (without the html) is: unicode test ä ö ü Ã.
Is this caused by an encoding mismatch between the server and the client?
I am using python 2.7.12 and twisted 17.1.0. The page has been accessed using firefox.
Sorry for my terrible english.
Thanks
EDIT: I found the problem. I assumed that twisted.web.static.File with a ResourceScript processor would use the encoding specified in the file in which the reactor is running.
Apparently this is not the case.
Adding # -*- coding: utf-8 -*- on the top of each file fixed the problem.

Related

Spynner wrong encoding

I'm trying to download this page - https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8 (looks like this for me in Russia - http://screencloud.net/v/6a7o) via spynner in python - it uses some javascript checking so one does not simply download it without full browser emulation.
My code:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from StringIO import StringIO
import spynner
def log(str, filename_end):
filename = '/tmp/apple_log_%s.html' % filename_end
print 'logged to %s' % filename
f = open(filename, 'w')
f.write(str)
f.close()
debug_stream = StringIO()
browser = spynner.Browser(debug_level=3, debug_stream=debug_stream)
browser.load("https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8")
ret = browser.contents
log(ret, 'noenc')
print 'content length = %s' % len(ret)
browser.close()
del browser
f=open('/tmp/apple_log_debug', 'w')
f.write(debug_stream.getvalue())
f.close()
print 'log stored in /tmp/debug_log'
So, the problem is: either apple, either spynner work wrong with Cyrillic symbols. I see them fine if I try browser.show() after loading, but in the code and logs they are still wrong encoded like <meta content="ÐолÑÑиÑÑ Farm Story⢠в App Store. ÐÑоÑмоÑÑеÑÑ ÑкÑинÑоÑÑ Ð¸ ÑейÑинги, пÑоÑиÑаÑÑ Ð¾ÑзÑÐ²Ñ Ð¿Ð¾ÐºÑпаÑелей." property="og:description">.
http://2cyr.com/ Says that it is a utf-8 text displayed like iso-8859-1...
As you see - I don't use any headers in my request, but if I take them from chrome's network debug console and pass it to load() method e.g. headers=[('Accept-Encoding', 'utf-8'), ('Accept-Language', 'ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4')] - I get the same result.
Also, from the same network console you can see that chrome uses gzip,deflate,sdch as Accept-Encoding. I can try that too, but I fail to decode what I get: <html><head></head><body>��}ksÇ�g!���4�I/z�O���/)�(yw���é®i��{�<v���:��ٷ�س-?�b�b�� j�... even if I remove the tags from the begin and end of the result.
Any help?
Basically, browser.webframe.toHtml() returns a QTString in which case str() won't help if res actually has unicode non-latin characters.
If you want to get a Python unicode string you need to do:
ret = unicode(browser.webframe.toHtml().toUtf8(), encoding="UTF-8")
#if you want to get rid of non-latin text
ret = ret.encode("ascii", errors="replace") # encodes to bytestring
in case you suspect its in Russian you could decode it to a Russian multibyte oem string (sill a bytestring) by doing
ret = ret.encode("cp1251", errors="replace") # encodes to Win-1251
# or
ret = ret.encode("cp866", errors="replace") # encodes to windows/dos console
Only then you can save it to an ASCII file.
str(browser.webframe.toHtml()) saved me

Using utf-8 characters in a Jinja2 template

I'm trying to use utf-8 characters when rendering a template with Jinja2. Here is how my template looks like:
<!DOCTYPE HTML>
<html manifest="" lang="en-US">
<head>
<meta charset="UTF-8">
<title>{{title}}</title>
...
The title variable is set something like this:
index_variables = {'title':''}
index_variables['title'] = myvar.encode("utf8")
template = env.get_template('index.html')
index_file = open(preview_root + "/" + "index.html", "w")
index_file.write(
template.render(index_variables)
)
index_file.close()
Now, the problem is that myvar is a message read from a message queue and can contain those special utf8 characters (ex. "Séptimo Cine").
The rendered template looks something like:
...
<title>S\u00e9ptimo Cine</title>
...
and I want it to be:
...
<title>Séptimo Cine</title>
...
I have made several tests but I can't get this to work.
I have tried to set the title variable without .encode("utf8"), but it throws an exception (ValueError: Expected a bytes object, not a unicode object), so my guess is that the initial message is unicode
I have used chardet.detect to get the encoding of the message (it's "ascii"), then did the following: myvar.decode("ascii").encode("cp852"), but the title is still not rendered correctly.
I also made sure that my template is a UTF-8 file, but it didn't make a difference.
Any ideas on how to do this?
TL;DR:
Pass Unicode to template.render()
Encode the rendered unicode result to a bytestring before writing it to a file
This had me puzzled for a while. Because you do
index_file.write(
template.render(index_variables)
)
in one statement, that's basically just one line where Python is concerned, so the traceback you get is misleading: The exception I got when recreating your test case didn't happen in template.render(index_variables), but in index_file.write() instead. So splitting the code up like this
output = template.render(index_variables)
index_file.write(output)
was the first step to diagnose where exactly the UnicodeEncodeError happens.
Jinja returns unicode whet you let it render the template. Therefore you need to encode the result to a bytestring before you can write it to a file:
index_file.write(output.encode('utf-8'))
The second error is that you pass in an utf-8 encoded bytestring to template.render() - Jinja wants unicode. So assuming your myvar contains UTF-8, you need to decode it to unicode first:
index_variables['title'] = myvar.decode('utf-8')
So, to put it all together, this works for me:
# -*- coding: utf-8 -*-
from jinja2 import Environment, PackageLoader
env = Environment(loader=PackageLoader('myproject', 'templates'))
# Make sure we start with an utf-8 encoded bytestring
myvar = 'Séptimo Cine'
index_variables = {'title':''}
# Decode the UTF-8 string to get unicode
index_variables['title'] = myvar.decode('utf-8')
template = env.get_template('index.html')
with open("index_file.html", "wb") as index_file:
output = template.render(index_variables)
# jinja returns unicode - so `output` needs to be encoded to a bytestring
# before writing it to a file
index_file.write(output.encode('utf-8'))
Try changing your render command to this...
template.render(index_variables).encode( "utf-8" )
Jinja2's documentation says "This will return the rendered template as unicode string."
http://jinja.pocoo.org/docs/api/?highlight=render#jinja2.Template.render
Hope this helps!
Add the following lines to the beginning of your script and it will work fine without any further changes:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

Python flask flash message exception remains after restarting

I'm making a small flask app where I had something like this:
#app.route('/bye')
def logout():
session.pop('logged_in', None)
flash('Adiós')
return redirect('/index')
Needless to say when I ran the application and I navigated to '/bye' it gave me a UnicodeDecodeError. Well, now it gives me the same unicodedecodeerror on every page that extends the base template (which renders the messages) even after restarting the application. and always with the same dump() despite removing that flash in the source code. All I can think of is what the crap? Help please.
Well I had to restart my computer to clear the stupid session cache or something.
I think that flash() actually creates a session called session['_flashes']. See this code here. So you will probably have to either:
clear/delete the cookie
OR
session.pop('_flashes', None)
Flask flashing stores the messages in a session cookie until they are succesfully "consumed".
If you get a UnicodeDecodeError (https://wiki.python.org/moin/UnicodeDecodeError) in this case the messages is not consumed, so you get the error again and again.
My solution was to delete the cookie from the browser
Since I had the problem when using localization, I solved the cause now by installing my translation object like:
trans = gettext.GNUTranslations(...)
trans.install(unicode=True)
and having UTF-8 encoding in my python source files and "Content-Type: text/plain; charset=UTF-8\n" in the translation file (.pot)
You're using an non ascii string "adiós", so you need to ensure that python will process strings as unicode, not as ascii.
Add this to the header of your python file. This will tell the compiler that your file contains utf8 strings
#!/usr/bin/env python
# -*- coding: utf-8 -*-
so your code will be something like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from flask import Flask
app = Flask()
#app.route('/bye')
def logout():
session.pop('logged_in', None)
flash('Adiós')
return redirect('/index')

Unicode issue with Python scraper

I've been writing bad perl for a while, but am attempting to learn to write bad python instead. I've read around the problem I've been having for a couple of days now (and know an awful lot more about unicode as a result) but I'm still having troubles with a rogue em-dash in the following code:
import urllib2
def scrape(url):
# simplified
data = urllib2.urlopen(url)
return data.read()
def query_graph_api(url_list):
# query Facebook's Graph API, store data.
for url in url_list:
graph_query = graph_query_root + "%22" + url + "%22"
query_data = scrape(graph_query)
print query_data #debug console
### START HERE ####
graph_query_root = "https://graph.facebook.com/fql?q=SELECT%20normalized_url,share_count,like_count,comment_count,total_count%20FROM%20link_stat%20WHERE%20url="
url_list = ['http://www.supersavvyme.co.uk', 'http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more']
query_graph_api(url_list)
(This is a much simplified representation of the scraper, BTW. The original uses a site's sitemap.xml to build a list of URLs, then queries Facebook's Graph API for information on each -- here's the original scraper)
My attempts to debug this have consisted mostly of trying to emulate the infinite monkeys who are rewriting Shakespeare. My usual method (search StackOverflow for the error message, copy-and-paste the solution) has failed.
Question: how do I encode my data so that extended characters like the em-dash in the second URL won't break my code, but will still work in the FQL query?
P.S. I'm even wondering whether I'm asking the right question: might urllib.urlencode help me out here (certainly it would make that graph_query_root easier and prettier to create...
---8<----
The traceback I get from the actual scraper on ScraperWiki is as follows:
http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more
Line 80 - query_graph_api(urls)
Line 53 - query_data = scrape(graph_query) -- query_graph_api((urls=['http://www.supersavvyme.co.uk', 'http://...more
Line 21 - data = urllib2.urlopen(unicode(url)) -- scrape((url=u'https://graph.facebook.com/fql?q=SELECT%20url,...more
/usr/lib/python2.7/urllib2.py:126 -- urlopen((url=u'https://graph.facebook.com/fql?q=SELECT%20url,no...more
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 177: ordinal not in range(128)
If you are using Python 3.x, all you have to do is add one line and change another:
gq = graph_query.encode('utf-8')
query_data = scrape(gq)
If you are using Python 2.x, first put the following line in at the top of the module file:
# -*- coding: utf-8 -*- (read what this is for here)
and then make all your string literals unicode and encode just before passing to urlopen:
def scrape(url):
# simplified
data = urllib2.urlopen(url)
return data.read()
def query_graph_api(url_list):
# query Facebook's Graph API, store data.
for url in url_list:
graph_query = graph_query_root + u"%22" + url + u"%22"
gq = graph_query.encode('utf-8')
query_data = scrape(gq)
print query_data #debug console
### START HERE ####
graph_query_root = u"https://graph.facebook.com/fql?q=SELECT%20normalized_url,share_count,like_count,comment_count,total_count%20FROM%20link_stat%20WHERE%20url="
url_list = [u'http://www.supersavvyme.co.uk', u'http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more']
query_graph_api(url_list)
It looks from the code like you are using 3.x, which is really better for dealing with stuff like this. But you still have to encode when necessary. In 2.x, the best advice is to do what 3.x does by default: use unicode throughout your code, and only encode when bytes are called for.

detect and change website encoding in python

I have a problem with website encoding. I maked a program to scrape a website but i didn't have successfully with changing encoding of readed content. My code is:
import sys,os,glob,re,datetime,optparse
import urllib2
from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup
#from utility import *
sTargetEncoding = "utf-8"
page_to_process = "http://www.xxxx.com"
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding
ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content
document = BSXPathEvaluator(ucontent)
print "ORIGINAL ENCODING: " + document.originalEncoding
I used external library (BSXPath an extension of BeautifulSoap) and the document.originalEncoding print the encoding of website and not the utf-8 encoding that I tried to change.
Have anyone some suggestion?
Thanks
Well, there is no guarantee that the encoding presented by the HTTP headers is the same the some specified inside the HTML itself. This can happen either due to misconfiguration on the server side or the charset definition inside the HTML can be just wrong. There is really no automatic way to detect the encoding or to detect the right encoding. I suggest to check HTML manually for the right encoding (e.g. iso-8859-1 vs. utf-8 can be easily detected) and then hardcode the encoding somehow manually inside your app.

Categories

Resources