Regex on unicode string - python

I am trying to download a few hundred Korean pages like this one:
http://homeplusexpress.com/store/store_view.asp?cd_express=3
For each page, I want to use a regex to extract the "address" field, which in the above page looks like:
*주소 : 서울시 광진구 구의1동 236-53
So I do this:
>>> import requests
>>> resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
>>> resp.encoding
'ISO-8859-1'
>>> # I wonder why it's ISO-8859-1, since I thought that is for Latin text (Latin-1).
>>> html = resp.text
>>> type(html)
<type 'unicode'>
>>> html
(outputs a long string that contains a lot of characters like \xc3\xb7\xaf\xbd\xba \xc0\xcd\xbd\xba\xc7\xc1\xb7\xb9\)
I then wrote a script. I set # -*- coding: utf-8 -*- on the .py file and put this:
address = re.search('주소', html)
However, re.search is returning None. I tried with and without the u prefix on the regex string.
Usually I can solve issues like this with a call to .encode or .decode but I tried a few things and am stuck. Any pointers on what I'm missing?

According to the tag in the html document header:
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
the web page uses the euc-kr encoding.
I wrote this code:
# -*- coding: euc-kr -*-
import re
import requests
resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
html = resp.text
address = re.search('주소', html)
print address
Then I saved it in gedit using the euc-kr encoding.
I got a match.
But actually there is an even better solution! You can keep the utf-8 encoding for your files.
# -*- coding: utf-8 -*-
import re
import requests
resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
resp.encoding = 'euc-kr'
# we need to specify what the encoding is because the
# requests library couldn't detect it correctly
html = resp.text
# now the html variable contains a utf-8 encoded unicode instance
print type(html)
# we use the re.search functions with unicode strings
address = re.search(u'주소', html)
print address

From requests documetation: When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers
If you check your website, we can see there is no encoding in server response:
I think the only option in this case is directly specify what encoding to use:
# -*- coding: utf-8 -*-
import requests
import re
r = requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
r.encoding = 'euc-kr'
print re.search(ur'주소', r.text, re.UNICODE)

Related

Why can't I convert unicode string to plain python string?

url = u'/wiki/Category:%E6%89%93%E7%A3%9A%E5%A1%8A'
The decoded string is (through https://www.urldecoder.org/):
decoded_url = u'/wiki/Category:打磚塊'
In python, I have the following code to do this conversion:
decoded_url = url.decode('utf-8')
This code doesn't change it at all. I also tried:
decoded_url = url.encode('utf-8')
The string remains the same. How to convert it to the decoded string I want?
Here's Python 2.7 code that gives you the result you want from the original string in your question:
import urlparse
utfStr = u"/wiki/Category:%E6%89%93%E7%A3%9A%E5%A1%8A"
asciiStr = utfStr.encode()
str = urlparse.unquote(asciiStr)
print(str)
Result:
/wiki/Category:打磚塊
It appears that unquote does the wrong thing when given a unicode string. You have to first convert it to single-byte string before unquote will do the right thing.
it is not UTF-8 encoding but url escaping or url quoting
import urllib.parse
print( urllib.parse.unquote( u'/wiki/Category:%E6%89%93%E7%A3%9A%E5%A1%8A') )
Result
/wiki/Category:打磚塊
Python 3.x doc: urllib.parse
EDIT:
Python 2.7 has it in module urlparse
import urlparse
print( urlparse.unquote(u'/wiki/Category:%E6%89%93%E7%A3%9A%E5%A1%8A') )
Python 2.7 doc: urlparse
EDIT:
After testing with Python 2.7 it needs encode() before unquote() to work with str (plain text) instead of unicode
#-*- coding: utf-8 -*-
import urlparse
url = u'/wiki/Category:%e6%89%93%E7%A3%9A%E5%A1%8A'
url = url.encode('utf-8') # convert `unicode` to `str`
url = urlparse.unquote(url) # convert `%e6%89%93%E7%A3%9A%E5%A1%8A` to `打磚塊`
print url
print type(url)
print '打磚塊' in url
Result
/wiki/Category:打磚塊
<type 'str'>
True
BTW: The same for Python 3 - it doesn't need encode()
import urllib.parse
url = u'/wiki/Category:%e6%89%93%E7%A3%9A%E5%A1%8A'
url = urllib.parse.unquote(url) # convert `%e6%89%93%E7%A3%9A%E5%A1%8A` to `打磚塊`
print(url)
print(type(url))
print('打磚塊' in url)
Result:
/wiki/Category:打磚塊
<class 'str'>
True

hi § symbol unrecognized

good morning.
I'm trying to do this and not leave me .
Can you help me?
thank you very much
soup = BeautifulSoup(html_page)
titulo=soup.find('h3').get_text()
titulo=titulo.replace('§','')
titulo=titulo.replace('§','')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Define the coding and operate with unicode strings:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
html_page = u"<h3>§ title here</h3>"
soup = BeautifulSoup(html_page, "html.parser")
titulo = soup.find('h3').get_text()
titulo = titulo.replace(u'§', '')
print(titulo)
Prints title here.
I'll explain you clearly what's the problem:
By default Python does not recognize particular characters like "à" or "ò". To make Python recognize those characters you have to put at the top of your script:
# -*- coding: utf-8 -*-
This codes makes Python recognize particular characters that by default are not recognized.
Another method to use the coding is using "sys" library:
# sys.setdefaultencoding() does not exist, here!
import sys
reload(sys) #This reloads the sys module
sys.setdefaultencoding('UTF8') #Here you choose the encoding

How to use pycurl when url contain non-English language?

This is the example on the pycurl's sourceforge page. And if the url contain like Chinese. What process should we do? Since pycurl does not support unicode?
import pycurl
c = pycurl.Curl()
c.setopt(pycurl.URL, "http://www.python.org/")
c.setopt(pycurl.HTTPHEADER, ["Accept:"])
import StringIO
b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
print b.getvalue()
Here's a script that demonstrates three separate issues:
non-ascii characters in Python source code
non-ascii characters in the url
non-ascii characters in the html content
# -*- coding: utf-8 -*-
import urllib
from StringIO import StringIO
import pycurl
title = u"UNIX时间" # 1
url = "https://zh.wikipedia.org/wiki/" + urllib.quote(title.encode('utf-8')) # 2
c = pycurl.Curl()
c.setopt(pycurl.URL, url)
c.setopt(pycurl.HTTPHEADER, ["Accept:"])
b = StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
data = b.getvalue() # bytes
print len(data), repr(data[:200])
html_page_charset = "utf-8" # 3
html_text = data.decode(html_page_charset)
print html_text[:200] # 4
Note: all utf-8 in the code are compeletely independent from each other.
Unicode literals use whatever character encoding you defined at the
top of the file. Make sure your text editor respects that setting
Path in the url should be encoded using utf-8 before it is
percent-encoded (urlencoded)
There are several ways to find out a html page charset. See
Character encodings in HTML. Some libraries such as requests mentioned by #Oz123 do it automatically:
# -*- coding: utf-8 -*-
import requests
r = requests.get(u"https://zh.wikipedia.org/wiki/UNIX时间")
print len(r.content), repr(r.content[:200]) # bytes
print r.encoding
print r.text[:200] # Unicode
To print Unicode to console you could use PYTHONIOENCODING environment variable to set character encoding that your terminal understands
See also The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and Python-specific Pragmatic Unicode.
Try urllib.quote, which will replace non-ASCII characters by an escape sequence:
import urllib
url_to_fetch = urllib.quote(unicode_url)
edit: only the path should be quoted, you will have to split the complete URL with urlparse, quote the path, and then use urlunparse to obtain the final URL to fetch.
Just encode your url in "utf-8", and everything would be fine. from the docs:
Under Python 3, the bytes type holds arbitrary encoded byte strings. PycURL will accept bytes values for all options where libcurl specifies a “string” argument:
>>> import pycurl
>>> c = pycurl.Curl()
>>> c.setopt(c.USERAGENT, b'Foo\xa9')
# ok
The str type holds Unicode data. PycURL will accept str values containing ASCII code points only:
>>> c.setopt(c.USERAGENT, 'Foo')
# ok
>>> c.setopt(c.USERAGENT, 'Foo\xa9')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xa9' in position 3:
ordinal not in range(128)
>>> c.setopt(c.USERAGENT, 'Foo\xa9'.encode('iso-8859-1'))
# ok
[1] http://pycurl.io/docs/latest/unicode.html

Python urllib2() functions with international/UTF-8 characters

For a personal research/fun project I am using the Python urllib2() function. However, when I have a link with non-ASCII chars, say, "الراجل اللى ورا عمر سليمان" or "我爸是李刚" then the interpreter (IDLE in Windows 7) runs into problems.
s = urllib2.urlopen("http://www.bing.com/search?q=我爸是李刚")
How should I go about rectifying this? (Should I convert my query into ASCII or is there a way to have urllib2 work with UTF-8 another way?)
s = urllib2.urlopen("http://www.bing.com/search?"
+ urllib.urlencode({ 'q' : u'我爸是李刚' .encode('utf8') } )
Should work.
# coding: utf-8
import urllib
import urlparse
scheme = 'http'
netloc = 'www.bing.com'
path = '/search'
qs = {'q': u'我爸是李刚'.encode('utf-8')}
print urlparse.urlunparse((scheme, netloc, path, '', urllib.urlencode(qs), ''))
# http://www.bing.com/search?q=%E6%88%91%E7%88%B8%E6%98%AF%E6%9D%8E%E5%88%9A

How to fetch a non-ascii url with urlopen?

I need to fetch data from a URL with non-ascii characters but urllib2.urlopen refuses to open the resource and raises:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 26: ordinal not in range(128)
I know the URL is not standards compliant but I have no chance to change it.
What is the way to access a resource pointed by a URL containing non-ascii characters using Python?
edit: In other words, can / how urlopen open a URL like:
http://example.org/Ñöñ-ÅŞÇİİ/
Strictly speaking URIs can't contain non-ASCII characters; what you have there is an IRI.
To convert an IRI to a plain ASCII URI:
non-ASCII characters in the hostname part of the address have to be encoded using the Punycode-based IDNA algorithm;
non-ASCII characters in the path, and most of the other parts of the address have to be encoded using UTF-8 and %-encoding, as per Ignacio's answer.
So:
import re, urlparse
def urlEncodeNonAscii(b):
return re.sub('[\x80-\xFF]', lambda c: '%%%02x' % ord(c.group(0)), b)
def iriToUri(iri):
parts= urlparse.urlparse(iri)
return urlparse.urlunparse(
part.encode('idna') if parti==1 else urlEncodeNonAscii(part.encode('utf-8'))
for parti, part in enumerate(parts)
)
>>> iriToUri(u'http://www.a\u0131b.com/a\u0131b')
'http://www.xn--ab-hpa.com/a%c4%b1b'
(Technically this still isn't quite good enough in the general case because urlparse doesn't split away any user:pass# prefix or :port suffix on the hostname. Only the hostname part should be IDNA encoded. It's easier to encode using normal urllib.quote and .encode('idna') at the time you're constructing a URL than to have to pull an IRI apart.)
In python3, use the urllib.parse.quote function on the non-ascii string:
>>> from urllib.request import urlopen
>>> from urllib.parse import quote
>>> chinese_wikipedia = 'http://zh.wikipedia.org/wiki/Wikipedia:' + quote('首页')
>>> urlopen(chinese_wikipedia)
Python 3 has libraries to handle this situation. Use
urllib.parse.urlsplit to split the URL into its components, and
urllib.parse.quote to properly quote/escape the unicode characters
and urllib.parse.urlunsplit to join it back together.
>>> import urllib.parse
>>> url = 'http://example.com/unicodè'
>>> url = urllib.parse.urlsplit(url)
>>> url = list(url)
>>> url[2] = urllib.parse.quote(url[2])
>>> url = urllib.parse.urlunsplit(url)
>>> print(url)
http://example.com/unicod%C3%A8
It is more complex than the accepted #bobince's answer suggests:
netloc should be encoded using IDNA;
non-ascii URL path should be encoded to UTF-8 and then percent-escaped;
non-ascii query parameters should be encoded to the encoding of a page URL was extracted from (or to the encoding server uses), then percent-escaped.
This is how all browsers work; it is specified in https://url.spec.whatwg.org/ - see this example. A Python implementation can be found in w3lib (this is the library Scrapy is using); see w3lib.url.safe_url_string:
from w3lib.url import safe_url_string
url = safe_url_string(u'http://example.org/Ñöñ-ÅŞÇİİ/', encoding="<page encoding>")
An easy way to check if a URL escaping implementation is incorrect/incomplete is to check if it provides 'page encoding' argument or not.
Based on #darkfeline answer:
from urllib.parse import urlsplit, urlunsplit, quote
def iri2uri(iri):
"""
Convert an IRI to a URI (Python 3).
"""
uri = ''
if isinstance(iri, str):
(scheme, netloc, path, query, fragment) = urlsplit(iri)
scheme = quote(scheme)
netloc = netloc.encode('idna').decode('utf-8')
path = quote(path)
query = quote(query)
fragment = quote(fragment)
uri = urlunsplit((scheme, netloc, path, query, fragment))
return uri
For those not depending strictly on urllib, one practical alternative is requests, which handles IRIs "out of the box".
For example, with http://bücher.ch:
>>> import requests
>>> r = requests.get(u'http://b\u00DCcher.ch')
>>> r.status_code
200
Encode the unicode to UTF-8, then URL-encode.
Use iri2uri method of httplib2. It makes the same thing as by bobin (is he/she the author of that?)
Another option to convert an IRI to an ASCII URI is to use furl package:
gruns/furl: 🌐 URL parsing and manipulation made easy. - https://github.com/gruns/furl
Python's standard urllib and urlparse modules provide a number of URL
related functions, but using these functions to perform common URL
operations proves tedious. Furl makes parsing and manipulating URLs
easy.
Examples
Non-ASCII domain
http://国立極地研究所.jp/english/ (Japanese National Institute of Polar Research website)
import furl
url = 'http://国立極地研究所.jp/english/'
furl.furl(url).tostr()
'http://xn--vcsoey76a2hh0vtuid5qa.jp/english/'
Non-ASCII path
https://ja.wikipedia.org/wiki/日本語 ("Japanese" article in Wikipedia)
import furl
url = 'https://ja.wikipedia.org/wiki/日本語'
furl.furl(url).tostr()
'https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E8%AA%9E'
works! finally
I could not avoid from this strange characters, but at the end I come through it.
import urllib.request
import os
url = "http://www.fourtourismblog.it/le-nuove-tendenze-del-marketing-tenere-docchio/"
with urllib.request.urlopen(url) as file:
html = file.read()
with open("marketingturismo.html", "w", encoding='utf-8') as file:
file.write(str(html.decode('utf-8')))
os.system("marketingturismo.html")

Categories

Resources