Reading Chinese characters in a file and sending them to a browser - python

I'm trying to make a program that:
reads a list of Chinese characters from a file, makes a dictionary from them (associating a sign with its meaning).
picks a random character and sends it to the browser using the BaseHTTPServer module when it gets a GET request.
Once I managed to read and store the signs properly (I tried writing them into another file to check that I got them right and it worked) I couldn't figure out how to send them to my browser.
I connect to 127.0.0.1:4321 and the best I've managed is to get a (supposedly) url-encoded Chinese character, with its translation.
Code:
# -*- coding: utf-8 -*-
import codecs
from BaseHTTPServer import HTTPServer, BaseHTTPRequestHandler
from SocketServer import ThreadingMixIn
import threading
import random
import urllib
source = codecs.open('./signs_db.txt', 'rb', encoding='utf-16')
# Checking utf-16 works fine with chinese characters and stuff :
#out = codecs.open('./test.txt', 'wb', encoding='utf-16')
#for line in source:
# out.write(line)
db = {}
next(source)
for line in source:
if not line.isspace():
tmp = line.split('\t')
db[tmp[0]] = tmp[1].strip()
class Handler(BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.end_headers()
message = threading.currentThread().getName()
rKey = random.choice(db.keys())
self.wfile.write(urllib.quote(rKey.encode("utf-8")) + ' : ' + db[rKey])
self.wfile.write('\n')
return
class ThreadedHTTPServer(ThreadingMixIn, HTTPServer):
"""Handle requests in a separate thread."""
if __name__ == '__main__':
server = ThreadedHTTPServer(('localhost', 4321), Handler)
print 'Starting server, use <Ctrl-C> to stop'
server.serve_forever()
If I don't urlencode the chinese character, I get an error from python :
self.wfile.write(rKey + ' : ' + db[rKey])
Which gives me this:
UnicodeEncodeError : 'ascii' codec can't encode character u'\u4e09' in position 0 : ordinal not in range(128)
I've also tried encoding/decoding with 'utf-16', and I still get that kind of error messages.
Here is my test file:
Sign Translation
一 One
二 Two
三 Three
四 Four
五 Five
六 Six
七 Seven
八 Eight
九 Nine
十 Ten
So, my question is: "How can I get the Chinese characters coming from my script to display properly in my browser"?

Declare the encoding of your page by writing a meta tag and make sure to encode the entire Unicode string in UTF-8:
self.wfile.write(u'''\
<html>
<headers>
<meta http-equiv="content-type" content="text/html;charset=UTF-8">
</headers>
<body>
{} : {}
</body>
</html>'''.format(rKey,db[rKey]).encode('utf8'))
And/or declare the HTTP content type:
self.send_response(200)
self.send_header('Content-Type','text/html; charset=utf-8')
self.end_headers()

Related

LZMA Returns Input Format not supported

It seems that when I try to decode some bytes that where decoded from base 64 it gives an Input Format not Supported. I cannot isolate the issue, as when I bring the decoded logic alone into a new file, the error will not happen, making me think that this is something to do with the way flask passes arguments to the functions.
Code:
from flask import Flask
import base64
import lzma
from urllib.parse import quote, unquote
app = Flask('app')
#app.route('/')
def hello_world():
return 'Hello, World!<br><button onclick = "var base = \'https://Text-Viewer-from-Bsace-64-URL.inyourface3445.repl.co/encode\';location.href = `${base}/${prompt(\'What do you want to send?\')}`" >Use</button>'
newline = '/n'
#app.route('/view/<path:b64>')
def viewer(b64):
print(type(b64))
s1 = base64.b64decode(b64.encode() + b'==')
s2 = lzma.decompress(s1).decode()
s3 = unquote(s2).replace(newline, '<br>')
return f'<div style="overflow-x: auto;">{s3}</div>'
#app.route('/encode/<path:txt>')
def encode(txt):
quote_text = quote(txt, safe = "")
compressed_text = lzma.compress(quote_text.encode())
base_64_txt = base64.b64encode(compressed_text).decode()
return f'text link '
app.run(host='0.0.0.0', port=8080, debug=True)
Can someone explain what I am doing wrong?
You are passing a base64-encoded string as a part of the URL, and that string may contain characters that gets mangled in the process.
For example, visiting /encode/hello will give the following URL:
https://text-viewer-from-bsace-64-url.inyourface3445.repl.co/view//Td6WFoAAATm1rRGAgAhARYAAAB0L+WjAQAEaGVsbG8AAAAAsTe52+XaHpsAAR0FuC2Arx+2830BAAAAAARZWg==
Several characters could go wrong:
The first character is /, and as a result Flask will redirect from view//TD6... to view/TD6...: in other words the first character gets deleted
Depending on how URL-encoding is performed by the browser and URL-decoding is performed by Flask, the + character may be decoded into a space
To avoid these issues, I would suggest using base64.urlsafe_b64encode / base64.urlsafe_b64decode which are versions of the base64 encoding where the output can be used in URLs without being mangled.
The following changes on your code seems to do the trick:
s1 = base64.urlsafe_b64decode(b64.encode()) in viewer
base_64_txt = base64.urlsafe_b64encode(compressed_text).decode() in encode

Twisted Web: problems encoding unicode

This is probably a stupid question/problem, but i could not find an answer for it. Also, it may not realy be twisted specific.
I am trying to write a resource for a twisted.web webserver, which should serve a page containing non-ascii characters.
According to this discusson, all i need to do is to set the Content-Type HTTP-Header and return an encoded string.
Unfortunately, the page shows invalid characters.
Here is the code (as a .rpy):
"""a unicode test"""
from twisted.web.resource import Resource
class UnicodeTestResource(Resource):
"""A unicode test resource."""
isLeaf = True
encoding = "utf-8"
def render_GET(self, request):
text = u"unicode test\n ä ö ü ß"
raw = u"<HTML><HEAD><TITLE>Unicode encoding test</TITLE><HEAD><BODY><P>{t}</P></BODY></HTML>".format(t=text)
enc = raw.encode(self.encoding)
request.setHeader("Content-Type", "text/html; charset=" + self.encoding)
return enc
resource = UnicodeTestResource()
The result (without the html) is: unicode test ä ö ü Ã.
Is this caused by an encoding mismatch between the server and the client?
I am using python 2.7.12 and twisted 17.1.0. The page has been accessed using firefox.
Sorry for my terrible english.
Thanks
EDIT: I found the problem. I assumed that twisted.web.static.File with a ResourceScript processor would use the encoding specified in the file in which the reactor is running.
Apparently this is not the case.
Adding # -*- coding: utf-8 -*- on the top of each file fixed the problem.

Spynner wrong encoding

I'm trying to download this page - https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8 (looks like this for me in Russia - http://screencloud.net/v/6a7o) via spynner in python - it uses some javascript checking so one does not simply download it without full browser emulation.
My code:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from StringIO import StringIO
import spynner
def log(str, filename_end):
filename = '/tmp/apple_log_%s.html' % filename_end
print 'logged to %s' % filename
f = open(filename, 'w')
f.write(str)
f.close()
debug_stream = StringIO()
browser = spynner.Browser(debug_level=3, debug_stream=debug_stream)
browser.load("https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8")
ret = browser.contents
log(ret, 'noenc')
print 'content length = %s' % len(ret)
browser.close()
del browser
f=open('/tmp/apple_log_debug', 'w')
f.write(debug_stream.getvalue())
f.close()
print 'log stored in /tmp/debug_log'
So, the problem is: either apple, either spynner work wrong with Cyrillic symbols. I see them fine if I try browser.show() after loading, but in the code and logs they are still wrong encoded like <meta content="ÐолÑÑиÑÑ Farm Story⢠в App Store. ÐÑоÑмоÑÑеÑÑ ÑкÑинÑоÑÑ Ð¸ ÑейÑинги, пÑоÑиÑаÑÑ Ð¾ÑзÑÐ²Ñ Ð¿Ð¾ÐºÑпаÑелей." property="og:description">.
http://2cyr.com/ Says that it is a utf-8 text displayed like iso-8859-1...
As you see - I don't use any headers in my request, but if I take them from chrome's network debug console and pass it to load() method e.g. headers=[('Accept-Encoding', 'utf-8'), ('Accept-Language', 'ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4')] - I get the same result.
Also, from the same network console you can see that chrome uses gzip,deflate,sdch as Accept-Encoding. I can try that too, but I fail to decode what I get: <html><head></head><body>��}ksÇ�g!���4�I/z�O���/)�(yw���é®i��{�<v���:��ٷ�س-?�b�b�� j�... even if I remove the tags from the begin and end of the result.
Any help?
Basically, browser.webframe.toHtml() returns a QTString in which case str() won't help if res actually has unicode non-latin characters.
If you want to get a Python unicode string you need to do:
ret = unicode(browser.webframe.toHtml().toUtf8(), encoding="UTF-8")
#if you want to get rid of non-latin text
ret = ret.encode("ascii", errors="replace") # encodes to bytestring
in case you suspect its in Russian you could decode it to a Russian multibyte oem string (sill a bytestring) by doing
ret = ret.encode("cp1251", errors="replace") # encodes to Win-1251
# or
ret = ret.encode("cp866", errors="replace") # encodes to windows/dos console
Only then you can save it to an ASCII file.
str(browser.webframe.toHtml()) saved me

Regex on unicode string

I am trying to download a few hundred Korean pages like this one:
http://homeplusexpress.com/store/store_view.asp?cd_express=3
For each page, I want to use a regex to extract the "address" field, which in the above page looks like:
*주소 : 서울시 광진구 구의1동 236-53
So I do this:
>>> import requests
>>> resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
>>> resp.encoding
'ISO-8859-1'
>>> # I wonder why it's ISO-8859-1, since I thought that is for Latin text (Latin-1).
>>> html = resp.text
>>> type(html)
<type 'unicode'>
>>> html
(outputs a long string that contains a lot of characters like \xc3\xb7\xaf\xbd\xba \xc0\xcd\xbd\xba\xc7\xc1\xb7\xb9\)
I then wrote a script. I set # -*- coding: utf-8 -*- on the .py file and put this:
address = re.search('주소', html)
However, re.search is returning None. I tried with and without the u prefix on the regex string.
Usually I can solve issues like this with a call to .encode or .decode but I tried a few things and am stuck. Any pointers on what I'm missing?
According to the tag in the html document header:
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
the web page uses the euc-kr encoding.
I wrote this code:
# -*- coding: euc-kr -*-
import re
import requests
resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
html = resp.text
address = re.search('주소', html)
print address
Then I saved it in gedit using the euc-kr encoding.
I got a match.
But actually there is an even better solution! You can keep the utf-8 encoding for your files.
# -*- coding: utf-8 -*-
import re
import requests
resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
resp.encoding = 'euc-kr'
# we need to specify what the encoding is because the
# requests library couldn't detect it correctly
html = resp.text
# now the html variable contains a utf-8 encoded unicode instance
print type(html)
# we use the re.search functions with unicode strings
address = re.search(u'주소', html)
print address
From requests documetation: When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers
If you check your website, we can see there is no encoding in server response:
I think the only option in this case is directly specify what encoding to use:
# -*- coding: utf-8 -*-
import requests
import re
r = requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
r.encoding = 'euc-kr'
print re.search(ur'주소', r.text, re.UNICODE)

Unicode issue with Python scraper

I've been writing bad perl for a while, but am attempting to learn to write bad python instead. I've read around the problem I've been having for a couple of days now (and know an awful lot more about unicode as a result) but I'm still having troubles with a rogue em-dash in the following code:
import urllib2
def scrape(url):
# simplified
data = urllib2.urlopen(url)
return data.read()
def query_graph_api(url_list):
# query Facebook's Graph API, store data.
for url in url_list:
graph_query = graph_query_root + "%22" + url + "%22"
query_data = scrape(graph_query)
print query_data #debug console
### START HERE ####
graph_query_root = "https://graph.facebook.com/fql?q=SELECT%20normalized_url,share_count,like_count,comment_count,total_count%20FROM%20link_stat%20WHERE%20url="
url_list = ['http://www.supersavvyme.co.uk', 'http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more']
query_graph_api(url_list)
(This is a much simplified representation of the scraper, BTW. The original uses a site's sitemap.xml to build a list of URLs, then queries Facebook's Graph API for information on each -- here's the original scraper)
My attempts to debug this have consisted mostly of trying to emulate the infinite monkeys who are rewriting Shakespeare. My usual method (search StackOverflow for the error message, copy-and-paste the solution) has failed.
Question: how do I encode my data so that extended characters like the em-dash in the second URL won't break my code, but will still work in the FQL query?
P.S. I'm even wondering whether I'm asking the right question: might urllib.urlencode help me out here (certainly it would make that graph_query_root easier and prettier to create...
---8<----
The traceback I get from the actual scraper on ScraperWiki is as follows:
http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more
Line 80 - query_graph_api(urls)
Line 53 - query_data = scrape(graph_query) -- query_graph_api((urls=['http://www.supersavvyme.co.uk', 'http://...more
Line 21 - data = urllib2.urlopen(unicode(url)) -- scrape((url=u'https://graph.facebook.com/fql?q=SELECT%20url,...more
/usr/lib/python2.7/urllib2.py:126 -- urlopen((url=u'https://graph.facebook.com/fql?q=SELECT%20url,no...more
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 177: ordinal not in range(128)
If you are using Python 3.x, all you have to do is add one line and change another:
gq = graph_query.encode('utf-8')
query_data = scrape(gq)
If you are using Python 2.x, first put the following line in at the top of the module file:
# -*- coding: utf-8 -*- (read what this is for here)
and then make all your string literals unicode and encode just before passing to urlopen:
def scrape(url):
# simplified
data = urllib2.urlopen(url)
return data.read()
def query_graph_api(url_list):
# query Facebook's Graph API, store data.
for url in url_list:
graph_query = graph_query_root + u"%22" + url + u"%22"
gq = graph_query.encode('utf-8')
query_data = scrape(gq)
print query_data #debug console
### START HERE ####
graph_query_root = u"https://graph.facebook.com/fql?q=SELECT%20normalized_url,share_count,like_count,comment_count,total_count%20FROM%20link_stat%20WHERE%20url="
url_list = [u'http://www.supersavvyme.co.uk', u'http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more']
query_graph_api(url_list)
It looks from the code like you are using 3.x, which is really better for dealing with stuff like this. But you still have to encode when necessary. In 2.x, the best advice is to do what 3.x does by default: use unicode throughout your code, and only encode when bytes are called for.

Categories

Resources