a Python cgi script receive a POST XMLHttpRequest with content type: application/x-www-form-urlencoded. The values are not encoded.
This looks like this:
xhr.open("POST", "URL", true);
xhr.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
xhr.send('RID=123&RName=Bäcker');
in Chromes developer Tools, the Form Data in Datasource looks like this in source:
RID=123&RName=Bäcker
or like this in parsed view:
RID: 123
RName: Bäcker
The python3 script get the form in a fieldstorage:
#! /usr/bin/env python3
# -*- coding: UTF-8 -*-
import cgi
myform = cgi.FieldStorage()
print(str(myform["RName"].value))
the print output is B\xe4cker?
I have tried to use .encode('iso-8859-1') or decode('utf-8') but it is not very successfull.
How can i change the encoding or code type that it will be displayed correctly?
Try adding encoding to str() function:
#! /usr/bin/env python3
# -*- coding: UTF-8 -*-
import cgi
myform = cgi.FieldStorage()
print(str(myform["RName"].value, encoding = 'iso-8859-1'))
Related
I want create contact app that can let user send feedback to terminal. I use mail_managers to do this thing. But I cannot solve code problem.
body = u"信息来自:%s\n\n\t%s" % (email,text)
mail_managers(full_reason, body)
I want terminal print below:
信息来自:youremail#domain.com
成功
Actually terminal print below:
淇℃伅鏉ヨ嚜:youremail#domain.com
鎴愬姛
Try adding this to the top of your python file
# -*- coding: utf-8 -*-
I am trying to download a few hundred Korean pages like this one:
http://homeplusexpress.com/store/store_view.asp?cd_express=3
For each page, I want to use a regex to extract the "address" field, which in the above page looks like:
*주소 : 서울시 광진구 구의1동 236-53
So I do this:
>>> import requests
>>> resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
>>> resp.encoding
'ISO-8859-1'
>>> # I wonder why it's ISO-8859-1, since I thought that is for Latin text (Latin-1).
>>> html = resp.text
>>> type(html)
<type 'unicode'>
>>> html
(outputs a long string that contains a lot of characters like \xc3\xb7\xaf\xbd\xba \xc0\xcd\xbd\xba\xc7\xc1\xb7\xb9\)
I then wrote a script. I set # -*- coding: utf-8 -*- on the .py file and put this:
address = re.search('주소', html)
However, re.search is returning None. I tried with and without the u prefix on the regex string.
Usually I can solve issues like this with a call to .encode or .decode but I tried a few things and am stuck. Any pointers on what I'm missing?
According to the tag in the html document header:
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
the web page uses the euc-kr encoding.
I wrote this code:
# -*- coding: euc-kr -*-
import re
import requests
resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
html = resp.text
address = re.search('주소', html)
print address
Then I saved it in gedit using the euc-kr encoding.
I got a match.
But actually there is an even better solution! You can keep the utf-8 encoding for your files.
# -*- coding: utf-8 -*-
import re
import requests
resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
resp.encoding = 'euc-kr'
# we need to specify what the encoding is because the
# requests library couldn't detect it correctly
html = resp.text
# now the html variable contains a utf-8 encoded unicode instance
print type(html)
# we use the re.search functions with unicode strings
address = re.search(u'주소', html)
print address
From requests documetation: When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers
If you check your website, we can see there is no encoding in server response:
I think the only option in this case is directly specify what encoding to use:
# -*- coding: utf-8 -*-
import requests
import re
r = requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
r.encoding = 'euc-kr'
print re.search(ur'주소', r.text, re.UNICODE)
I'm trying to use utf-8 characters when rendering a template with Jinja2. Here is how my template looks like:
<!DOCTYPE HTML>
<html manifest="" lang="en-US">
<head>
<meta charset="UTF-8">
<title>{{title}}</title>
...
The title variable is set something like this:
index_variables = {'title':''}
index_variables['title'] = myvar.encode("utf8")
template = env.get_template('index.html')
index_file = open(preview_root + "/" + "index.html", "w")
index_file.write(
template.render(index_variables)
)
index_file.close()
Now, the problem is that myvar is a message read from a message queue and can contain those special utf8 characters (ex. "Séptimo Cine").
The rendered template looks something like:
...
<title>S\u00e9ptimo Cine</title>
...
and I want it to be:
...
<title>Séptimo Cine</title>
...
I have made several tests but I can't get this to work.
I have tried to set the title variable without .encode("utf8"), but it throws an exception (ValueError: Expected a bytes object, not a unicode object), so my guess is that the initial message is unicode
I have used chardet.detect to get the encoding of the message (it's "ascii"), then did the following: myvar.decode("ascii").encode("cp852"), but the title is still not rendered correctly.
I also made sure that my template is a UTF-8 file, but it didn't make a difference.
Any ideas on how to do this?
TL;DR:
Pass Unicode to template.render()
Encode the rendered unicode result to a bytestring before writing it to a file
This had me puzzled for a while. Because you do
index_file.write(
template.render(index_variables)
)
in one statement, that's basically just one line where Python is concerned, so the traceback you get is misleading: The exception I got when recreating your test case didn't happen in template.render(index_variables), but in index_file.write() instead. So splitting the code up like this
output = template.render(index_variables)
index_file.write(output)
was the first step to diagnose where exactly the UnicodeEncodeError happens.
Jinja returns unicode whet you let it render the template. Therefore you need to encode the result to a bytestring before you can write it to a file:
index_file.write(output.encode('utf-8'))
The second error is that you pass in an utf-8 encoded bytestring to template.render() - Jinja wants unicode. So assuming your myvar contains UTF-8, you need to decode it to unicode first:
index_variables['title'] = myvar.decode('utf-8')
So, to put it all together, this works for me:
# -*- coding: utf-8 -*-
from jinja2 import Environment, PackageLoader
env = Environment(loader=PackageLoader('myproject', 'templates'))
# Make sure we start with an utf-8 encoded bytestring
myvar = 'Séptimo Cine'
index_variables = {'title':''}
# Decode the UTF-8 string to get unicode
index_variables['title'] = myvar.decode('utf-8')
template = env.get_template('index.html')
with open("index_file.html", "wb") as index_file:
output = template.render(index_variables)
# jinja returns unicode - so `output` needs to be encoded to a bytestring
# before writing it to a file
index_file.write(output.encode('utf-8'))
Try changing your render command to this...
template.render(index_variables).encode( "utf-8" )
Jinja2's documentation says "This will return the rendered template as unicode string."
http://jinja.pocoo.org/docs/api/?highlight=render#jinja2.Template.render
Hope this helps!
Add the following lines to the beginning of your script and it will work fine without any further changes:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
I'm running into some Unicode related issues when generating PDF reports using Geraldo and ReportLab.
When Unicode strings containing Asian characters are passed into the report, they appear in the output PDF as black boxes. This example (http://dl.dropbox.com/u/2627296/report.pdf) was generated using the following code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from geraldo import Report, ReportBand, ObjectValue
from geraldo.generators import PDFGenerator
class UnicodeReport(Report):
title = 'Report'
class band_detail(ReportBand):
elements = [ObjectValue(attribute_name='name')]
if __name__ == '__main__':
objects = [{'name': u'한국어/조선말'}, {'name': u'汉语/漢語'}, {'name': u'オナカップ'}]
rpt = UnicodeReport(queryset=objects)
rpt.generate_by(PDFGenerator, filename='/tmp/report.pdf')
I'm using Python 2.7.1, Geraldo 0.4.14 and ReportLab 2.5. System is Ubuntu 11.04 64-bit. The .oy file is also UTF-8 encoded. The black boxes are visible when the PDF is viewed in Document Viewer 2.32.0, Okular 0.12.2 and Adobe Reader 9.
Any help is greatly appreciated, thanks.
You should specify the font name as in the official example "Additional Fonts". Use additional_fonts and default_style:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from geraldo import Report, ReportBand, ObjectValue
from geraldo.generators import PDFGenerator
class UnicodeReport(Report):
title = 'Report'
additional_fonts = {
'wqy': '/usr/share/fonts/wqy-zenhei/wqy-zenhei.ttc'
}
default_style = {'fontName': 'wqy'}
class band_detail(ReportBand):
elements = [ObjectValue(attribute_name='name')]
if __name__ == '__main__':
objects = [{'name': u'한국어/조선말'}, {'name': u'汉语/漢語'}, {'name': u'オナカップ'}]
rpt = UnicodeReport(queryset=objects)
rpt.generate_by(PDFGenerator, filename='/tmp/report.pdf')
ObjectValue() also has a named parameter style:
elements = [ObjectValue(attribute_name='name', style={'fontName': 'wqy'})]
This font is open source and can be downloaded here: http://sourceforge.net/projects/wqy/files/ (I think it's shipped with Ubuntu 11.04)
"query" = джазовыми
For some reason...when I display it via:
{{ query|safe }}
I get this:
%u0434%u0436%u0430%u0437%u043E%u0432%u044B%u043C%u0438
Would the query be set from the source, this would solve it:
query = u"джазовыми"
(provided that for example your file encoding is utf-8 and you have corresponding line
# -*- coding: UTF-8 -*-
in the beginning)
But I guess the query is entered by user. The error seems to be located in that part of your code. Can you quote how it is done?