Unicode characters are boxes in Geraldo/ReportLab generated PDF - python

I'm running into some Unicode related issues when generating PDF reports using Geraldo and ReportLab.
When Unicode strings containing Asian characters are passed into the report, they appear in the output PDF as black boxes. This example (http://dl.dropbox.com/u/2627296/report.pdf) was generated using the following code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from geraldo import Report, ReportBand, ObjectValue
from geraldo.generators import PDFGenerator
class UnicodeReport(Report):
title = 'Report'
class band_detail(ReportBand):
elements = [ObjectValue(attribute_name='name')]
if __name__ == '__main__':
objects = [{'name': u'한국어/조선말'}, {'name': u'汉语/漢語'}, {'name': u'オナカップ'}]
rpt = UnicodeReport(queryset=objects)
rpt.generate_by(PDFGenerator, filename='/tmp/report.pdf')
I'm using Python 2.7.1, Geraldo 0.4.14 and ReportLab 2.5. System is Ubuntu 11.04 64-bit. The .oy file is also UTF-8 encoded. The black boxes are visible when the PDF is viewed in Document Viewer 2.32.0, Okular 0.12.2 and Adobe Reader 9.
Any help is greatly appreciated, thanks.

You should specify the font name as in the official example "Additional Fonts". Use additional_fonts and default_style:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from geraldo import Report, ReportBand, ObjectValue
from geraldo.generators import PDFGenerator
class UnicodeReport(Report):
title = 'Report'
additional_fonts = {
'wqy': '/usr/share/fonts/wqy-zenhei/wqy-zenhei.ttc'
}
default_style = {'fontName': 'wqy'}
class band_detail(ReportBand):
elements = [ObjectValue(attribute_name='name')]
if __name__ == '__main__':
objects = [{'name': u'한국어/조선말'}, {'name': u'汉语/漢語'}, {'name': u'オナカップ'}]
rpt = UnicodeReport(queryset=objects)
rpt.generate_by(PDFGenerator, filename='/tmp/report.pdf')
ObjectValue() also has a named parameter style:
elements = [ObjectValue(attribute_name='name', style={'fontName': 'wqy'})]
This font is open source and can be downloaded here: http://sourceforge.net/projects/wqy/files/ (I think it's shipped with Ubuntu 11.04)

Related

special chars in CGI Python \\xe4 not ä

a Python cgi script receive a POST XMLHttpRequest with content type: application/x-www-form-urlencoded. The values are not encoded.
This looks like this:
xhr.open("POST", "URL", true);
xhr.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
xhr.send('RID=123&RName=Bäcker');
in Chromes developer Tools, the Form Data in Datasource looks like this in source:
RID=123&RName=Bäcker
or like this in parsed view:
RID: 123
RName: Bäcker
The python3 script get the form in a fieldstorage:
#! /usr/bin/env python3
# -*- coding: UTF-8 -*-
import cgi
myform = cgi.FieldStorage()
print(str(myform["RName"].value))
the print output is B\xe4cker?
I have tried to use .encode('iso-8859-1') or decode('utf-8') but it is not very successfull.
How can i change the encoding or code type that it will be displayed correctly?
Try adding encoding to str() function:
#! /usr/bin/env python3
# -*- coding: UTF-8 -*-
import cgi
myform = cgi.FieldStorage()
print(str(myform["RName"].value, encoding = 'iso-8859-1'))

BeautifulSoup: scraping spanish characters issue

I'm trying to get some Spanish text from a website using BeautifulSoup and urllib2. I currently get this: ¡Hola! ¿Cómo estás?.
I have tried applying the different unicode functions I have seen on related threads, but nothing seems to work for my issue:
# import the main window object (mw) from aqt
from aqt import mw
# import the "show info" tool from utils.py
from aqt.utils import showInfo
# import all of the Qt GUI library
from aqt.qt import *
from BeautifulSoup import BeautifulSoup
import urllib2
wiki = "http://spanishdict.com/translate/hola"
page = urllib2.urlopen(wiki)
soup = BeautifulSoup(page)
dictionarydiv = soup.find("div", { "class" : "dictionary-neodict-example" })
dictionaryspans = dictionarydiv.contents
firstspan = dictionaryspans[0]
firstspantext = firstspan.contents
thetext = firstspantext[0]
thetextstring = str(thetext)
thetext is type <class 'BeautifulSoup.NavigableString'>. Printing it returns a Unicode string, which will be encoded in the output terminal encoding:
print thetext
Output (in a Windows console):
¡Hola! ¿Cómo estás?
This will work on any terminal configured for an encoding supporting the Unicode characters being printed.
You'll get UnicodeEncodeError if your terminal is configured with an encoding that doesn't support the Unicode characters you try to print.
Using str on that type returns a byte string...in this case encoded in UTF-8. If you print that on anything but a UTF-8-configured terminal, you'll get an incorrect display.

hi § symbol unrecognized

good morning.
I'm trying to do this and not leave me .
Can you help me?
thank you very much
soup = BeautifulSoup(html_page)
titulo=soup.find('h3').get_text()
titulo=titulo.replace('§','')
titulo=titulo.replace('§','')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Define the coding and operate with unicode strings:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
html_page = u"<h3>§ title here</h3>"
soup = BeautifulSoup(html_page, "html.parser")
titulo = soup.find('h3').get_text()
titulo = titulo.replace(u'§', '')
print(titulo)
Prints title here.
I'll explain you clearly what's the problem:
By default Python does not recognize particular characters like "à" or "ò". To make Python recognize those characters you have to put at the top of your script:
# -*- coding: utf-8 -*-
This codes makes Python recognize particular characters that by default are not recognized.
Another method to use the coding is using "sys" library:
# sys.setdefaultencoding() does not exist, here!
import sys
reload(sys) #This reloads the sys module
sys.setdefaultencoding('UTF8') #Here you choose the encoding

How to use Django mail_managers function to print utf8?

I want create contact app that can let user send feedback to terminal. I use mail_managers to do this thing. But I cannot solve code problem.
body = u"信息来自:%s\n\n\t%s" % (email,text)
mail_managers(full_reason, body)
I want terminal print below:
信息来自:youremail#domain.com
成功
Actually terminal print below:
淇℃伅鏉ヨ嚜:youremail#domain.com
鎴愬姛
Try adding this to the top of your python file
# -*- coding: utf-8 -*-

Regex on unicode string

I am trying to download a few hundred Korean pages like this one:
http://homeplusexpress.com/store/store_view.asp?cd_express=3
For each page, I want to use a regex to extract the "address" field, which in the above page looks like:
*주소 : 서울시 광진구 구의1동 236-53
So I do this:
>>> import requests
>>> resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
>>> resp.encoding
'ISO-8859-1'
>>> # I wonder why it's ISO-8859-1, since I thought that is for Latin text (Latin-1).
>>> html = resp.text
>>> type(html)
<type 'unicode'>
>>> html
(outputs a long string that contains a lot of characters like \xc3\xb7\xaf\xbd\xba \xc0\xcd\xbd\xba\xc7\xc1\xb7\xb9\)
I then wrote a script. I set # -*- coding: utf-8 -*- on the .py file and put this:
address = re.search('주소', html)
However, re.search is returning None. I tried with and without the u prefix on the regex string.
Usually I can solve issues like this with a call to .encode or .decode but I tried a few things and am stuck. Any pointers on what I'm missing?
According to the tag in the html document header:
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
the web page uses the euc-kr encoding.
I wrote this code:
# -*- coding: euc-kr -*-
import re
import requests
resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
html = resp.text
address = re.search('주소', html)
print address
Then I saved it in gedit using the euc-kr encoding.
I got a match.
But actually there is an even better solution! You can keep the utf-8 encoding for your files.
# -*- coding: utf-8 -*-
import re
import requests
resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
resp.encoding = 'euc-kr'
# we need to specify what the encoding is because the
# requests library couldn't detect it correctly
html = resp.text
# now the html variable contains a utf-8 encoded unicode instance
print type(html)
# we use the re.search functions with unicode strings
address = re.search(u'주소', html)
print address
From requests documetation: When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers
If you check your website, we can see there is no encoding in server response:
I think the only option in this case is directly specify what encoding to use:
# -*- coding: utf-8 -*-
import requests
import re
r = requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
r.encoding = 'euc-kr'
print re.search(ur'주소', r.text, re.UNICODE)

Categories

Resources