I'm trying to download this page - https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8 (looks like this for me in Russia - http://screencloud.net/v/6a7o) via spynner in python - it uses some javascript checking so one does not simply download it without full browser emulation.
My code:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from StringIO import StringIO
import spynner
def log(str, filename_end):
filename = '/tmp/apple_log_%s.html' % filename_end
print 'logged to %s' % filename
f = open(filename, 'w')
f.write(str)
f.close()
debug_stream = StringIO()
browser = spynner.Browser(debug_level=3, debug_stream=debug_stream)
browser.load("https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8")
ret = browser.contents
log(ret, 'noenc')
print 'content length = %s' % len(ret)
browser.close()
del browser
f=open('/tmp/apple_log_debug', 'w')
f.write(debug_stream.getvalue())
f.close()
print 'log stored in /tmp/debug_log'
So, the problem is: either apple, either spynner work wrong with Cyrillic symbols. I see them fine if I try browser.show() after loading, but in the code and logs they are still wrong encoded like <meta content="ÐолÑÑиÑÑ Farm Story⢠в App Store. ÐÑоÑмоÑÑеÑÑ ÑкÑинÑоÑÑ Ð¸ ÑейÑинги, пÑоÑиÑаÑÑ Ð¾ÑзÑÐ²Ñ Ð¿Ð¾ÐºÑпаÑелей." property="og:description">.
http://2cyr.com/ Says that it is a utf-8 text displayed like iso-8859-1...
As you see - I don't use any headers in my request, but if I take them from chrome's network debug console and pass it to load() method e.g. headers=[('Accept-Encoding', 'utf-8'), ('Accept-Language', 'ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4')] - I get the same result.
Also, from the same network console you can see that chrome uses gzip,deflate,sdch as Accept-Encoding. I can try that too, but I fail to decode what I get: <html><head></head><body>��}ksÇ�g!���4�I/z�O���/)�(yw���é®i��{�<v���:��ٷ�س-?�b�b�� j�... even if I remove the tags from the begin and end of the result.
Any help?
Basically, browser.webframe.toHtml() returns a QTString in which case str() won't help if res actually has unicode non-latin characters.
If you want to get a Python unicode string you need to do:
ret = unicode(browser.webframe.toHtml().toUtf8(), encoding="UTF-8")
#if you want to get rid of non-latin text
ret = ret.encode("ascii", errors="replace") # encodes to bytestring
in case you suspect its in Russian you could decode it to a Russian multibyte oem string (sill a bytestring) by doing
ret = ret.encode("cp1251", errors="replace") # encodes to Win-1251
# or
ret = ret.encode("cp866", errors="replace") # encodes to windows/dos console
Only then you can save it to an ASCII file.
str(browser.webframe.toHtml()) saved me
Related
I'm trying to get some Spanish text from a website using BeautifulSoup and urllib2. I currently get this: ¡Hola! ¿Cómo estás?.
I have tried applying the different unicode functions I have seen on related threads, but nothing seems to work for my issue:
# import the main window object (mw) from aqt
from aqt import mw
# import the "show info" tool from utils.py
from aqt.utils import showInfo
# import all of the Qt GUI library
from aqt.qt import *
from BeautifulSoup import BeautifulSoup
import urllib2
wiki = "http://spanishdict.com/translate/hola"
page = urllib2.urlopen(wiki)
soup = BeautifulSoup(page)
dictionarydiv = soup.find("div", { "class" : "dictionary-neodict-example" })
dictionaryspans = dictionarydiv.contents
firstspan = dictionaryspans[0]
firstspantext = firstspan.contents
thetext = firstspantext[0]
thetextstring = str(thetext)
thetext is type <class 'BeautifulSoup.NavigableString'>. Printing it returns a Unicode string, which will be encoded in the output terminal encoding:
print thetext
Output (in a Windows console):
¡Hola! ¿Cómo estás?
This will work on any terminal configured for an encoding supporting the Unicode characters being printed.
You'll get UnicodeEncodeError if your terminal is configured with an encoding that doesn't support the Unicode characters you try to print.
Using str on that type returns a byte string...in this case encoded in UTF-8. If you print that on anything but a UTF-8-configured terminal, you'll get an incorrect display.
I've been searching a lot for this without any luck. So I thought maybe the problem is because I'm missing some concepts or don't understand what I really need, so here is the problem:
I'm using pisa to create a pdf and this is the code I use for it:
def write_to_pdf(template_data, context_dict, filename):
template = Template(template_data)
context = Context(context_dict)
html = template.render(context)
result = StringIO.StringIO()
pdf = pisa.pisaDocument(StringIO.StringIO(html.encode("UTF-8")), result, link_callback=fetch_resources)
if not pdf.err:
response = http.HttpResponse(mimetype='application/pdf')
response['Content-Disposition'] = 'attachment; filename=%s.pdf' % filename
response.write(result.getvalue())
return response
return http.HttpResponse('Problem creating PDF: %s' % cgi.escape(html))
So if I try to make this string become a pdf:
template_data = 'tésting á'
It turns into something like this(consider # being a black spot instead of letter):
t##sting á
I tried to use cgi.escape without any luck because the black spot would be still there and it ends up printing html tags. It's python 2.7 so I can't use html.escape and solve all my problems.
So I need something that can convert the normal text into html entities without affecting the html tags already there. Any clues?
Oh and if I change that line:
pdf = pisa.pisaDocument(StringIO.StringIO(html.encode("UTF-8")), result, link_callback=fetch_resources)
to
pdf = pisa.pisaDocument(html, result, link_callback=fetch_resources)
it works, but it doesn't create the html entities, which I need because I don't know exactly what kind of character will be placed there and might not get supported by pisa.
Encode named HTML entities with Python
http://beckism.com/2009/03/named_entities_python/
There is also a django app for both decoding and encoding:
https://github.com/cobrateam/python-htmlentities
For Python 2.x (Change to html.entities.codepoint2name in Python 3.x):
'''
Registers a special handler for named HTML entities
Usage:
import named_entities
text = u'Some string with Unicode characters'
text = text.encode('ascii', 'named_entities')
'''
import codecs
from htmlentitydefs import codepoint2name
def named_entities(text):
if isinstance(text, (UnicodeEncodeError, UnicodeTranslateError)):
s = []
for c in text.object[text.start:text.end]:
if ord(c) in codepoint2name:
s.append(u'&%s;' % codepoint2name[ord(c)])
else:
s.append(u'&#%s;' % ord(c))
return ''.join(s), text.end
else:
raise TypeError("Can't handle %s" % text.__name__)
codecs.register_error('named_entities', named_entities)
I'm trying to make a program that:
reads a list of Chinese characters from a file, makes a dictionary from them (associating a sign with its meaning).
picks a random character and sends it to the browser using the BaseHTTPServer module when it gets a GET request.
Once I managed to read and store the signs properly (I tried writing them into another file to check that I got them right and it worked) I couldn't figure out how to send them to my browser.
I connect to 127.0.0.1:4321 and the best I've managed is to get a (supposedly) url-encoded Chinese character, with its translation.
Code:
# -*- coding: utf-8 -*-
import codecs
from BaseHTTPServer import HTTPServer, BaseHTTPRequestHandler
from SocketServer import ThreadingMixIn
import threading
import random
import urllib
source = codecs.open('./signs_db.txt', 'rb', encoding='utf-16')
# Checking utf-16 works fine with chinese characters and stuff :
#out = codecs.open('./test.txt', 'wb', encoding='utf-16')
#for line in source:
# out.write(line)
db = {}
next(source)
for line in source:
if not line.isspace():
tmp = line.split('\t')
db[tmp[0]] = tmp[1].strip()
class Handler(BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.end_headers()
message = threading.currentThread().getName()
rKey = random.choice(db.keys())
self.wfile.write(urllib.quote(rKey.encode("utf-8")) + ' : ' + db[rKey])
self.wfile.write('\n')
return
class ThreadedHTTPServer(ThreadingMixIn, HTTPServer):
"""Handle requests in a separate thread."""
if __name__ == '__main__':
server = ThreadedHTTPServer(('localhost', 4321), Handler)
print 'Starting server, use <Ctrl-C> to stop'
server.serve_forever()
If I don't urlencode the chinese character, I get an error from python :
self.wfile.write(rKey + ' : ' + db[rKey])
Which gives me this:
UnicodeEncodeError : 'ascii' codec can't encode character u'\u4e09' in position 0 : ordinal not in range(128)
I've also tried encoding/decoding with 'utf-16', and I still get that kind of error messages.
Here is my test file:
Sign Translation
一 One
二 Two
三 Three
四 Four
五 Five
六 Six
七 Seven
八 Eight
九 Nine
十 Ten
So, my question is: "How can I get the Chinese characters coming from my script to display properly in my browser"?
Declare the encoding of your page by writing a meta tag and make sure to encode the entire Unicode string in UTF-8:
self.wfile.write(u'''\
<html>
<headers>
<meta http-equiv="content-type" content="text/html;charset=UTF-8">
</headers>
<body>
{} : {}
</body>
</html>'''.format(rKey,db[rKey]).encode('utf8'))
And/or declare the HTTP content type:
self.send_response(200)
self.send_header('Content-Type','text/html; charset=utf-8')
self.end_headers()
I have the following code for urllib and BeautifulSoup:
getSite = urllib.urlopen(pageName) # open current site
getSitesoup = BeautifulSoup(getSite.read()) # reading the site content
print getSitesoup.originalEncoding
for value in getSitesoup.find_all('link'): # extract all <a> tags
defLinks.append(value.get('href'))
The result of it:
/usr/lib/python2.6/site-packages/bs4/dammit.py:231: UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
"Some characters could not be decoded, and were "
And when i try to read the site i get:
�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��#�����]���(P��^��q�$�S5���tT*�Z
The page is in UTF-8, but the server is sending it to you in a compressed format:
>>> print getSite.headers['content-encoding']
gzip
You'll need to decompress the data before running it through Beautiful Soup. I got an error using zlib.decompress() on the data, but writing the data to a file and using gzip.open() to read from it worked fine--I'm not sure why.
BeautifulSoup works with Unicode internally; it'll try and decode non-unicode responses from UTF-8 by default.
It looks like the site you are trying to load is using a different encode; for example, it could be UTF-16 instead:
>>> print u"""�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��#�����]���(P��^��q�$�S5���tT*�Z""".encode('utf-8').decode('utf-16-le')
뿯㞽뿯施뿯붿뿯붿⨰䤢럟뿯䞽뿯䢽뿯붿뿯붿붿뿯붿뿯붿뿯㦽붿뿯붿뿯붿뿯㮽뿯붿붿썙䊞붿뿯붿뿯붿뿯붿뿯붿铣㾶뿯㒽붿뿯붿붿뿯붿뿯붿坞뿯붿뿯붿뿯悽붿敋뿯붿붿뿯⪽붿✮兏붿뿯붿붿뿯䂽뿯붿뿯붿뿯嶽뿯붿뿯⢽붿뿯庽뿯붿붿붿㕓뿯붿뿯璽⩔뿯媽
It could be mac_cyrillic too:
>>> print u"""�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��#�����]���(P��^��q�$�S5���tT*�Z""".encode('utf-8').decode('mac_cyrillic')
пњљ7пњљeпњљпњљпњљпњљ0*"IяЈпњљGпњљHпњљпњљпњљпњљFпњљпњљпњљпњљпњљпњљ9-пњљпњљпњљпњљпњљпњљ;пњљпњљEпњљY√ЮBsпњљпњљпњљпњљпњљпњљпњљпњљпњљгФґ?пњљ4iпњљпњљпњљ)пњљпњљпњљпњљпњљ^Wпњљпњљпњљпњљпњљ`wпњљKeпњљпњљ%пњљпњљ*9пњљ.'OQBпњљпњљпњљVпњљпњљ#пњљпњљпњљпњљпњљ]пњљпњљпњљ(Pпњљпњљ^пњљпњљqпњљ$пњљS5пњљпњљпњљtT*пњљZ
But I have way too little information about what kind of site you are trying to load nor can I read the output of either encoding. :-)
You'll need to decode the result of getSite() before passing it to BeautifulSoup:
getSite = urllib.urlopen(pageName).decode('utf-16')
Generally, the website will return what encoding was used in the headers, in the form of a Content-Type header (probably text/html; charset=utf-16 or similar).
I ran into the same problem, and as Leonard mentioned, it was due to a compressed format.
This link solved it for me which says to add ('Accept-Encoding', 'gzip,deflate') in the request header. For example:
opener = urllib2.build_opener()
opener.addheaders = [('Referer', referer),
('User-Agent', uagent),
('Accept-Encoding', 'gzip,deflate')]
usock = opener.open(url)
url = usock.geturl()
data = decode(usock)
usock.close()
return data
Where the decode() function is defined by:
def decode (page):
encoding = page.info().get("Content-Encoding")
if encoding in ('gzip', 'x-gzip', 'deflate'):
content = page.read()
if encoding == 'deflate':
data = StringIO.StringIO(zlib.decompress(content))
else:
data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
page = data.read()
return page
I have a problem with website encoding. I maked a program to scrape a website but i didn't have successfully with changing encoding of readed content. My code is:
import sys,os,glob,re,datetime,optparse
import urllib2
from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup
#from utility import *
sTargetEncoding = "utf-8"
page_to_process = "http://www.xxxx.com"
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding
ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content
document = BSXPathEvaluator(ucontent)
print "ORIGINAL ENCODING: " + document.originalEncoding
I used external library (BSXPath an extension of BeautifulSoap) and the document.originalEncoding print the encoding of website and not the utf-8 encoding that I tried to change.
Have anyone some suggestion?
Thanks
Well, there is no guarantee that the encoding presented by the HTTP headers is the same the some specified inside the HTML itself. This can happen either due to misconfiguration on the server side or the charset definition inside the HTML can be just wrong. There is really no automatic way to detect the encoding or to detect the right encoding. I suggest to check HTML manually for the right encoding (e.g. iso-8859-1 vs. utf-8 can be easily detected) and then hardcode the encoding somehow manually inside your app.