I'm trying to download some content from a dictionary site like http://dictionary.reference.com/browse/apple?s=t
The problem I'm having is that the original paragraph has all those squiggly lines, and reverse letters, and such, so when I read the local files I end up with those funny escape characters like \x85, \xa7, \x8d, etc.
My question is, is there any way i can convert all those escape characters into their respective UTF-8 characters, eg if there is an 'à' how do i convert that into a standard 'a' ?
Python calling code:
import os
word = 'apple'
os.system(r'wget.lnk --directory-prefix=G:/projects/words/dictionary/urls/ --output-document=G:\projects\words\dictionary\urls/' + word + '-dict.html http://dictionary.reference.com/browse/' + word)
I'm using wget-1.11.4-1 on a Windows 7 system (don't kill me Linux people, it was a client requirement), and the wget exe is being fired off with a Python 2.6 script file.
how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?
Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...
import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
Explicit example...
>>> myfoo = u'àà'
>>> myfoo
u'\xe0\xe0'
>>> unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'
>>>
How it works
unicodedata.normalize('NFD', "insert-unicode-text-here") performs a Canonical Decomposition (NFD) of the unicode text; then we use str.encode('ascii', 'ignore') to transform the NFD mapped characters into ascii (ignoring errors).
#Mike Pennington's solution works great thanks to him. but when I tried that solution I notice that it fails some special characters (i.e. ı character from Turkish alphabet) which has not defined at NFD.
I discovered another solution which you can use unidecode library to this conversion.
>>>import unidecode
>>>example = "ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZabcçdefgğhıijklmnoöprsştuüvyz"
#convert it to utf-8
>>>utf8text = unicode(example, "utf-8")
>>> print utf8text
ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZabcçdefgğhıijklmnoöprsştuüvyz
#convert utf-8 to ascii text
asciitext = unidecode.unidecode(utf8text)
>>>print asciitext
ABCCDEFGGHIIJKLMNOOPRSSTUUVYZabccdefgghiijklmnooprsstuuvyz
I needed something like this but to remove only accented characters, ignoring special ones and I did this small function:
# ~*~ coding: utf-8 ~*~
import re
def remove_accents(string):
if type(string) is not unicode:
string = unicode(string, encoding='utf-8')
string = re.sub(u"[àáâãäå]", 'a', string)
string = re.sub(u"[èéêë]", 'e', string)
string = re.sub(u"[ìíîï]", 'i', string)
string = re.sub(u"[òóôõö]", 'o', string)
string = re.sub(u"[ùúûü]", 'u', string)
string = re.sub(u"[ýÿ]", 'y', string)
return string
I like that function because you can customize it in case you need to ignore other characters
The given URL returns UTF-8 as the HTTP response clearly indicates:
wget -S http://dictionary.reference.com/browse/apple?s=t
--2013-01-02 08:43:40-- http://dictionary.reference.com/browse/apple?s=t
Resolving dictionary.reference.com (dictionary.reference.com)... 23.14.94.26, 23.14.94.11
Connecting to dictionary.reference.com (dictionary.reference.com)|23.14.94.26|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Server: Apache
Cache-Control: private
Content-Type: text/html;charset=UTF-8
Date: Wed, 02 Jan 2013 07:43:40 GMT
Transfer-Encoding: chunked
Connection: keep-alive
Connection: Transfer-Encoding
Set-Cookie: sid=UOPlLC7t-zl20-k7; Domain=reference.com; Expires=Wed, 02-Jan-2013 08:13:40 GMT; Path=/
Set-Cookie: cu.wz=0; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
Set-Cookie: recsrch=apple; Domain=reference.com; Expires=Tue, 02-Apr-2013 07:43:40 GMT; Path=/
Set-Cookie: dcc=*~*~*~*~*~*~*~*~; Domain=reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
Set-Cookie: iv_dic=1-0; Domain=reference.com; Expires=Thu, 03-Jan-2013 07:43:40 GMT; Path=/
Set-Cookie: accepting=1; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
Set-Cookie: bid=UOPlLC7t-zlrHXne; Domain=reference.com; Expires=Fri, 02-Jan-2015 07:43:40 GMT; Path=/
Length: unspecified [text/html]
Investigating the saved file using vim also reveals that the data is correctly utf-8 encoded...the same is true fetching the URL using Python.
the issue was different for me but this stack page works to resolved it unicodedata.normalize('NFKC', 'V').encode('ascii', 'ignore')
output - b'V'
Related
I have built a REST interface. On '400 Bad Request' it returns a json body with specific information about the error.
(Pdb) error.code
400
Python correctly throws a URLError with these headers
(Pdb) print(error.headers)
Cache-Control: no-cache
Pragma: no-cache
Content-Type: application/json; charset=utf-8
Expires: -1
Server: Microsoft-IIS/7.5
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Sat, 20 Aug 2016 13:01:05 GMT
Connection: close
Content-Length: 236
There is a content of 236 char, but I cannot find a way to read the body.
I can see the extra information using DHC chrome plugin
{
"error_code": "00000001",
"error_message": "The json data is not in the correct json format.\r\nThe json data is not in the correct json format.\r\n'Execution Start Time' must not be empty.\r\n'Execution End Time' must not be empty.\r\n"
}
However, I cannot find a way in Python to read the body
Here are some of the things I have tried and what was returned.
(Pdb) len(error.read())
0
error.read().decode('utf-8', 'ignore')
''
(Pdb) error.readline()
b''
I found that this works the first time it is called, but does not work if called again.
error.read().decode('utf-8')
I want to specify the HTTP response charset by modifying the Content-Type header. However, it doesn't work.
Here is a short example:
#coding=utf-8
import cherrypy
class Website:
#cherrypy.expose()
def index(self):
cherrypy.response.headers['Content-Type']='text/plain; charset=gbk'
return '。。。'.encode('gbk')
cherrypy.quickstart(Website(),'/',{
'/': {
'tools.response_headers.on':True,
}
})
And when I visit that page, the Content-Type is changed mysteriously to text/plain;charset=utf-8, causing mojibake in the browser.
C:\Users\Administrator>ncat 127.0.0.1 8080 -C
GET / HTTP/1.1
Host: 127.0.0.1:8080
HTTP/1.1 200 OK
Server: CherryPy/7.1.0
Content-Length: 6
Content-Type: text/plain;charset=utf-8
Date: Mon, 22 Aug 2016 01:08:13 GMT
。。。^C
It seems that CherryPy detect content encoding and override the charset automatically. If so, how can I disable this feature?
All right. Solved this problem by tampering cherrypy.response.header_list directly.
#coding=utf-8
import cherrypy
def set_content_type():
header=(b'Content-Type',cherrypy.response._content_type.encode())
for ind,(key,_) in enumerate(cherrypy.response.header_list):
if key.lower()==b'content-type':
cherrypy.response.header_list[ind]=header
break
else:
cherrypy.response.header_list.append(header)
cherrypy.tools.set_content_type=cherrypy.Tool('on_end_resource',set_content_type)
class Website:
#cherrypy.expose()
#cherrypy.tools.set_content_type()
def index(self):
cherrypy.response._content_type='text/plain; charset=gbk'
return '。。。'.encode('gbk')
cherrypy.quickstart(Website(),'/')
I had success to set the content-type-charset by setting/manipulate the request header attribute "Accept-Charset".
cherrypy.request.headers["Accept-Charset"] = "ISO-8859-1"
cherrypy.response.headers["Content-Type"] = "text/xml"
The result:
>curl -I https://127.0.0.1/url?param=value
HTTP/1.1 200 OK
Content-Type: text/xml;charset=ISO-8859-1
Server: CherryPy/18.6.0
Date: Mon, 10 Aug 2020 11:54:49 GMT
Content-Length: 288
Set-Cookie: session_id=d28fa46a1a3d901d9502038255ce869b21add5cc; expires=Mon, 10 Aug 2020 12:54:49 GMT; Max-Age=3600; Path=/
I would like to print ONLY the line which contains "Server" in the below piece of output:
Date: Sun, 16 Dec 2012 20:07:44 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=da8d52b67e5c7522:FF=0:TM=1355688464:LM=1355688464:S=CrK5vV-qb3UgWUM1; expires=Tue, 16-Dec-2014 20:07:44 GMT; path=/; domain=.google.com
Set-Cookie: NID=67=nICkwXDM6H7TNQfHbo06FbvZhO61bzNmtOn4HA71ukaVDSgywlBjBkAR-gXCpMNo1TlYym-eYMUlMkCHVpj7bDRwiHT6jkr7z4dMrApDuTk_HuTrZrkoctKlS7lXjz9a; expires=Mon, 17-Jun-2013 20:07:44 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Connection: close
This information is fetched from a list called websiteheaders. I have the below piece of code which is driving me crazy that it is not working properly...
for line in websiteheaders:
if "Server" in line:
print line
Now this above piece of code prints exactly the same block of text that is described at the beginning of my post. I just dont seem to get why it does that...
As I've said, I only want to print the line that contains "Server", if possible without regex. And if not possible, with regex.
Please help and thanks!
EDIT: My complete code so far is pasted here: http://pastebin.com/sYuZyvX9
EDIT2: For completeness, in hosts.txt there currently is 1 host named "google.com"
Update
My code was actually working fine, but there was a mistake in a other piece of my code which ensured that the data that was put into the list websiteheaders was 1 large string instead of multiple entries. In the above piece of code, it will ofcourse find "Server" and print the whole entry, which in my case was the full (large) string.
Using
websiteheaders.extend(headers.splitlines())
instead of
websiteheaders.append(headers)
did the trick for me. Thanks alot guys.
Is websiteheaders really a list which is split for very line? Because if it's a string you should use:
for line in websiteheaders.splitlines():
if "Server" in line:
print line
Also, a good tip: I would recommend adding some print-statements on encountering this kind of problems. If you would have added something like:
else:
print 'WRONG LINE:', line
You probably would have catched that this loop was not looping over every line but over every character.
Update
I can't wee what's wrong with your code then. This is what I get:
In [3]: websiteheaders
Out[3]:
['Date: Sun, 16 Dec 2012 20:07:44 GMT',
'Expires: -1',
'Cache-Control: private, max-age=0',
'Content-Type: text/html; charset=ISO-8859-1',
'Set-Cookie: PREF=ID=da8d52b67e5c7522:FF=0:TM=1355688464:LM=1355688464:S=CrK5vV-qb3UgWUM1; expires=Tue, 16-Dec-2014 20:07:44 GMT; path=/; domain=.google.com',
'Set-Cookie: NID=67=nICkwXDM6H7TNQfHbo06FbvZhO61bzNmtOn4HA71ukaVDSgywlBjBkAR-gXCpMNo1TlYym-eYMUlMkCHVpj7bDRwiHT6jkr7z4dMrApDuTk_HuTrZrkoctKlS7lXjz9a; expires=Mon, 17-Jun-2013 20:07:44 GMT; path=/; domain=.google.com; HttpOnly',
'P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."',
'Server: gws',
'X-XSS-Protection: 1; mode=block',
'X-Frame-Options: SAMEORIGIN',
'Connection: close"']
In [4]: for line in websiteheaders:
...: if 'Server' in line:
...: print line
...:
Server: gws
for single_line in websiteheaders.splitlines():
if `Server` in single_line:
print single_line
I'm dealing with this python bug while writing my own reverse proxy. The server is sending my proxy this Set-Cookie response header:
workgroup_session_id=ilDJtR0rE1AG28C9ZxKLHj8TBtcT89sw; Path=/; Expires=Sun, 02-Dec-2012 5:57:25 GMT; HttpOnly
I am loading this string into a SimpleCookie instance from the Cookie module. Unfortunately, because of the bug that I referenced above, when I later pull expires out of the morsel dictionary it returns Sun,. I have found that I can overcome this bug by adding quotes around the Expires component of the Set-Cookie header (or adding quotes around any key / value pair that contains spaces in the value).
So this:
workgroup_session_id=ilDJtR0rE1AG28C9ZxKLHj8TBtcT89sw; Path=/; Expires=Sun, 02-Dec-2012 5:57:25 GMT; HttpOnly
Would become:
workgroup_session_id=ilDJtR0rE1AG28C9ZxKLHj8TBtcT89sw; Path=/; Expires="Sun, 02-Dec-2012 5:57:25 GMT"; HttpOnly
And this:
test=a b c; Path=/; Expires=a b c; HttpOnly
Would become:
test="a b c"; Path=/; Expires="a b c"; HttpOnly
I know that I could break the string into tokens and loop through them looking for spaces, then reconstruct the string, but I am curious what the best performing solution would be. As I mentioned, this is a reverse proxy that could potentially handle a few hundred requests a second, so I'd like this substitution to be as fast as possible.
Would a regular expression substitution (pre-compiled of course) be efficient? I've heard that regular expressions are pretty heavy....
How about this regex:
import re
header = re.sub("(?<==)[^;]* [^;]*", r'"\g<0>"', header)
This inserts quotes around whatever follows after a = until the next ; (or end of string), but only if there is at least one space in-between.
>>> header = 'test=a b c; Path=/; Expires=a b c; HttpOnly'
>>> re.sub("(?<==)[^;]* [^;]*", r'"\g<0>"', header)
'test="a b c"; Path=/; Expires="a b c"; HttpOnly'
>>> header = "workgroup_session_id=ilDJtR0rE1AG28C9ZxKLHj8TBtcT89sw; Path=/; Expires=Sun, 02-Dec-2012 5:57:25 GMT; HttpOnly"
>>> re.sub("(?<==)[^;]* [^;]*", r'"\g<0>"', header)
'workgroup_session_id=ilDJtR0rE1AG28C9ZxKLHj8TBtcT89sw; Path=/; Expires="Sun, 02-Dec-2012 5:57:25 GMT"; HttpOnly'
Do you need to put quotes just around the date following Expires, or any arbitrary date that appears anywhere in the header? If it's the former, try this:
header = "workgroup_session_id=ilDJtR0rE1AG28C9ZxKLHj8TBtcT89sw; Path=/; Expires=Sun, 02-Dec-2012 5:57:25 GMT; HttpOnly"
print(header.replace('Expires=', 'Expires="').replace('GMT', 'GMT"'))
I am calling the URL :
http://code.google.com/feeds/issues/p/chromium/issues/full/291?alt=json
using urllib2 and decoding using the json module
url = "http://code.google.com/feeds/issues/p/chromium/issues/full/291?alt=json"
request = urllib2.Request(query)
response = urllib2.urlopen(request)
issue_report = json.loads(response.read())
I run into the following error :
ValueError: Invalid control character at: line 1 column 1120 (char 1120)
I tried checking the header and I got the following :
Content-Type: application/json; charset=UTF-8
Access-Control-Allow-Origin: *
Expires: Sun, 03 Jul 2011 17:38:38 GMT
Date: Sun, 03 Jul 2011 17:38:38 GMT
Cache-Control: private, max-age=0, must-revalidate, no-transform
Vary: Accept, X-GData-Authorization, GData-Version
GData-Version: 1.0
ETag: W/"CUEGQX47eCl7ImA9WxJaFEw."
Last-Modified: Tue, 04 Aug 2009 19:20:20 GMT
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Server: GSE
Connection: close
I also tried adding an encoding parameter as follows :
issue_report = json.loads(response.read() , encoding = 'UTF-8')
I still run into the same error.
The feed has raw data from a JPEG in it at that point; the JSON is malformed, so it's not your fault. Report a bug to Google.
You could consider using lxml instead, since the JSON is malformed. It's XPath support makes working with XML pretty straight-forward:
import lxml.etree
url = 'http://code.google.com/feeds/issues/p/chromium/issues/full/291'
doc = lxml.etree.parse(url)
ns = {'issues': 'http://schemas.google.com/projecthosting/issues/2009'}
issues = doc.xpath('//issues:*', namespaces=ns)
Fairly easy to manipulate elements, for instance to strip namespace from tags, convert to dict:
>>> dict((x.tag[len(ns['issues'])+2:], x.text) for x in issues)
<<<
{'closedDate': '2009-08-04T19:20:20.000Z',
'id': '291',
'label': 'Area-BrowserUI',
'stars': '13',
'state': 'closed',
'status': 'Verified'}