If I do
url = "http://example.com?p=" + urllib.quote(query)
It doesn't encode / to %2F (breaks OAuth normalization)
It doesn't handle Unicode (it throws an exception)
Is there a better library?
Python 2
From the documentation:
urllib.quote(string[, safe])
Replace special characters in string
using the %xx escape. Letters, digits,
and the characters '_.-' are never
quoted. By default, this function is
intended for quoting the path section
of the URL.The optional safe parameter
specifies additional characters that
should not be quoted — its default
value is '/'
That means passing '' for safe will solve your first issue:
>>> urllib.quote('/test')
'/test'
>>> urllib.quote('/test', safe='')
'%2Ftest'
About the second issue, there is a bug report about it. Apparently it was fixed in Python 3. You can workaround it by encoding as UTF-8 like this:
>>> query = urllib.quote(u"Müller".encode('utf8'))
>>> print urllib.unquote(query).decode('utf8')
Müller
By the way, have a look at urlencode.
Python 3
In Python 3, the function quote has been moved to urllib.parse:
>>> import urllib.parse
>>> print(urllib.parse.quote("Müller".encode('utf8')))
M%C3%BCller
>>> print(urllib.parse.unquote("M%C3%BCller"))
Müller
In Python 3, urllib.quote has been moved to urllib.parse.quote, and it does handle Unicode by default.
>>> from urllib.parse import quote
>>> quote('/test')
'/test'
>>> quote('/test', safe='')
'%2Ftest'
>>> quote('/El Niño/')
'/El%20Ni%C3%B1o/'
I think module requests is much better. It's based on urllib3.
You can try this:
>>> from requests.utils import quote
>>> quote('/test')
'/test'
>>> quote('/test', safe='')
'%2Ftest'
My answer is similar to Paolo's answer.
If you're using Django, you can use urlquote:
>>> from django.utils.http import urlquote
>>> urlquote(u"Müller")
u'M%C3%BCller'
Note that changes to Python mean that this is now a legacy wrapper. From the Django 2.1 source code for django.utils.http:
A legacy compatibility wrapper to Python's urllib.parse.quote() function.
(was used for unicode handling on Python 2)
It is better to use urlencode here. There isn't much difference for a single parameter, but, IMHO, it makes the code clearer. (It looks confusing to see a function quote_plus! - especially those coming from other languages.)
In [21]: query='lskdfj/sdfkjdf/ksdfj skfj'
In [22]: val=34
In [23]: from urllib.parse import urlencode
In [24]: encoded = urlencode(dict(p=query,val=val))
In [25]: print(f"http://example.com?{encoded}")
http://example.com?p=lskdfj%2Fsdfkjdf%2Fksdfj+skfj&val=34
Documentation
urlencode
quote_plus
An alternative method using furl:
import furl
url = "https://httpbin.org/get?hello,world"
print(url)
url = furl.furl(url).url
print(url)
Output:
https://httpbin.org/get?hello,world
https://httpbin.org/get?hello%2Cworld
Related
If I do
url = "http://example.com?p=" + urllib.quote(query)
It doesn't encode / to %2F (breaks OAuth normalization)
It doesn't handle Unicode (it throws an exception)
Is there a better library?
Python 2
From the documentation:
urllib.quote(string[, safe])
Replace special characters in string
using the %xx escape. Letters, digits,
and the characters '_.-' are never
quoted. By default, this function is
intended for quoting the path section
of the URL.The optional safe parameter
specifies additional characters that
should not be quoted — its default
value is '/'
That means passing '' for safe will solve your first issue:
>>> urllib.quote('/test')
'/test'
>>> urllib.quote('/test', safe='')
'%2Ftest'
About the second issue, there is a bug report about it. Apparently it was fixed in Python 3. You can workaround it by encoding as UTF-8 like this:
>>> query = urllib.quote(u"Müller".encode('utf8'))
>>> print urllib.unquote(query).decode('utf8')
Müller
By the way, have a look at urlencode.
Python 3
In Python 3, the function quote has been moved to urllib.parse:
>>> import urllib.parse
>>> print(urllib.parse.quote("Müller".encode('utf8')))
M%C3%BCller
>>> print(urllib.parse.unquote("M%C3%BCller"))
Müller
In Python 3, urllib.quote has been moved to urllib.parse.quote, and it does handle Unicode by default.
>>> from urllib.parse import quote
>>> quote('/test')
'/test'
>>> quote('/test', safe='')
'%2Ftest'
>>> quote('/El Niño/')
'/El%20Ni%C3%B1o/'
I think module requests is much better. It's based on urllib3.
You can try this:
>>> from requests.utils import quote
>>> quote('/test')
'/test'
>>> quote('/test', safe='')
'%2Ftest'
My answer is similar to Paolo's answer.
If you're using Django, you can use urlquote:
>>> from django.utils.http import urlquote
>>> urlquote(u"Müller")
u'M%C3%BCller'
Note that changes to Python mean that this is now a legacy wrapper. From the Django 2.1 source code for django.utils.http:
A legacy compatibility wrapper to Python's urllib.parse.quote() function.
(was used for unicode handling on Python 2)
It is better to use urlencode here. There isn't much difference for a single parameter, but, IMHO, it makes the code clearer. (It looks confusing to see a function quote_plus! - especially those coming from other languages.)
In [21]: query='lskdfj/sdfkjdf/ksdfj skfj'
In [22]: val=34
In [23]: from urllib.parse import urlencode
In [24]: encoded = urlencode(dict(p=query,val=val))
In [25]: print(f"http://example.com?{encoded}")
http://example.com?p=lskdfj%2Fsdfkjdf%2Fksdfj+skfj&val=34
Documentation
urlencode
quote_plus
An alternative method using furl:
import furl
url = "https://httpbin.org/get?hello,world"
print(url)
url = furl.furl(url).url
print(url)
Output:
https://httpbin.org/get?hello,world
https://httpbin.org/get?hello%2Cworld
I am using Readability Parser API to extract content from a web page. It is ok when the web page is in Latin character set, but when I extract article in Cyrillic, it ends up with the following:
<div>Ввоскресень</div>...etc
The interesting thing here is that the title of a web page is extracted correctly in Cyrillic, but not the content. My attempt was to do the following as it suggested in this SO answer:
content = unicodedata.normalize('NFKD', content).encode('ascii','ignore')
but it did not work. Could you tell me if there is a way to convert this string before saving to database?
Please let me know if the title of my question explains correctly what I need. Thank you.
One way (Python 3.3):
>>> s='<div>Ввоскресень</div>'
>>> import html.parser
>>> h=html.parser.HTMLParser()
>>> h.unescape(s)
'<div>Ввоскресень</div>'
Python 2.7:
>>> s='<div>Ввоскресень</div>'
>>> import HTMLParser
>>> h=HTMLParser.HTMLParser()
>>> print(h.unescape(s))
<div>Ввоскресень</div>
P.S. I went to look for the documentation link and it looks like unescape isn't documented. Here's a way without using an undocumented API:
>>> re.sub(r'&#x(.*?);',lambda x: chr(int(x.group(1),16)),s)
'<div>Ввоскресень</div>'
Per comment it looks finally documented (and moved) in Python 3.4:
https://docs.python.org/3.4/library/html.html#html.unescape
My string is Niệm Bồ Tát (Thiá»n sÆ° Nhất Hạnh) and I want to decode it to Niệm Bồ Tát (Thiền sư Nhất Hạnh). I see in that site can do that http://www.enderminh.com/minh/utf8-to-unicode-converter.aspx
and I start to try by Python
mystr = '09. Bát Nhã Tâm Kinh'
mystr.decode('utf-8')
but actually it is not correct because original string is utf-8 but the string show is not my expecting result.
Note: it is Vietnamese character.
How to resolve that case? Is that Windows Unicode or something? How to detect the encoding here.
The only thing that helped me with broken cyrillic string - https://github.com/LuminosoInsight/python-ftfy
This module fixes pretty much everything and works much better than online decoders.
>>> from ftfy import fix_encoding
>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> fix_encoding(mystr)
'09. Bát Nhã Tâm Kinh'
It can be easily installed using pip install ftfy
I'm not sure what you can do with these kind of data, but for your example in your original post, this works (Python 3.x):
>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> s = mystr.encode('latin1').decode('utf8')
>>> s
'09. Bát Nhã Tâm Kinh'
>>> print(s)
09. Bát Nhã Tâm Kinh
Try:
str.encode('ascii', 'ignore').decode('utf-8')
You're encoding the string in ASCII format / ignoring the errors and decoding in UTF-8. This may remove the accents, but it's one approach.
The correct method in python 3.9.6 is:
"string".encode('utf-8').decode('latin-1')
"string".encode('latin1').decode('utf8')
If I do
url = "http://example.com?p=" + urllib.quote(query)
It doesn't encode / to %2F (breaks OAuth normalization)
It doesn't handle Unicode (it throws an exception)
Is there a better library?
Python 2
From the documentation:
urllib.quote(string[, safe])
Replace special characters in string
using the %xx escape. Letters, digits,
and the characters '_.-' are never
quoted. By default, this function is
intended for quoting the path section
of the URL.The optional safe parameter
specifies additional characters that
should not be quoted — its default
value is '/'
That means passing '' for safe will solve your first issue:
>>> urllib.quote('/test')
'/test'
>>> urllib.quote('/test', safe='')
'%2Ftest'
About the second issue, there is a bug report about it. Apparently it was fixed in Python 3. You can workaround it by encoding as UTF-8 like this:
>>> query = urllib.quote(u"Müller".encode('utf8'))
>>> print urllib.unquote(query).decode('utf8')
Müller
By the way, have a look at urlencode.
Python 3
In Python 3, the function quote has been moved to urllib.parse:
>>> import urllib.parse
>>> print(urllib.parse.quote("Müller".encode('utf8')))
M%C3%BCller
>>> print(urllib.parse.unquote("M%C3%BCller"))
Müller
In Python 3, urllib.quote has been moved to urllib.parse.quote, and it does handle Unicode by default.
>>> from urllib.parse import quote
>>> quote('/test')
'/test'
>>> quote('/test', safe='')
'%2Ftest'
>>> quote('/El Niño/')
'/El%20Ni%C3%B1o/'
I think module requests is much better. It's based on urllib3.
You can try this:
>>> from requests.utils import quote
>>> quote('/test')
'/test'
>>> quote('/test', safe='')
'%2Ftest'
My answer is similar to Paolo's answer.
If you're using Django, you can use urlquote:
>>> from django.utils.http import urlquote
>>> urlquote(u"Müller")
u'M%C3%BCller'
Note that changes to Python mean that this is now a legacy wrapper. From the Django 2.1 source code for django.utils.http:
A legacy compatibility wrapper to Python's urllib.parse.quote() function.
(was used for unicode handling on Python 2)
It is better to use urlencode here. There isn't much difference for a single parameter, but, IMHO, it makes the code clearer. (It looks confusing to see a function quote_plus! - especially those coming from other languages.)
In [21]: query='lskdfj/sdfkjdf/ksdfj skfj'
In [22]: val=34
In [23]: from urllib.parse import urlencode
In [24]: encoded = urlencode(dict(p=query,val=val))
In [25]: print(f"http://example.com?{encoded}")
http://example.com?p=lskdfj%2Fsdfkjdf%2Fksdfj+skfj&val=34
Documentation
urlencode
quote_plus
An alternative method using furl:
import furl
url = "https://httpbin.org/get?hello,world"
print(url)
url = furl.furl(url).url
print(url)
Output:
https://httpbin.org/get?hello,world
https://httpbin.org/get?hello%2Cworld
I'd like to know do I normalize a URL in python.
For example, If I have a url string like : "http://www.example.com/foo goo/bar.html"
I need a library in python that will transform the extra space (or any other non normalized character) to a proper URL.
Have a look at this module: werkzeug.utils. (now in werkzeug.urls)
The function you are looking for is called "url_fix" and works like this:
>>> from werkzeug.urls import url_fix
>>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')
'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'
It's implemented in Werkzeug as follows:
import urllib
import urlparse
def url_fix(s, charset='utf-8'):
"""Sometimes you get an URL by a user that just isn't a real
URL because it contains unsafe characters like ' ' and so on. This
function can fix some of the problems in a similar way browsers
handle data entered by the user:
>>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')
'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'
:param charset: The target charset for the URL if the url was
given as unicode string.
"""
if isinstance(s, unicode):
s = s.encode(charset, 'ignore')
scheme, netloc, path, qs, anchor = urlparse.urlsplit(s)
path = urllib.quote(path, '/%')
qs = urllib.quote_plus(qs, ':&=')
return urlparse.urlunsplit((scheme, netloc, path, qs, anchor))
Real fix in Python 2.7 for that problem
Right solution was:
# percent encode url, fixing lame server errors for e.g, like space
# within url paths.
fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'#()*[]")
For more information see Issue918368: "urllib doesn't correct server returned urls"
use urllib.quote or urllib.quote_plus
From the urllib documentation:
quote(string[, safe])
Replace special characters in string
using the "%xx" escape. Letters,
digits, and the characters "_.-" are
never quoted. The optional safe
parameter specifies additional
characters that should not be quoted
-- its default value is '/'.
Example: quote('/~connolly/') yields '/%7econnolly/'.
quote_plus(string[, safe])
Like quote(), but also replaces spaces
by plus signs, as required for quoting
HTML form values. Plus signs in the
original string are escaped unless
they are included in safe. It also
does not have safe default to '/'.
EDIT: Using urllib.quote or urllib.quote_plus on the whole URL will mangle it, as #ΤΖΩΤΖΙΟΥ points out:
>>> quoted_url = urllib.quote('http://www.example.com/foo goo/bar.html')
>>> quoted_url
'http%3A//www.example.com/foo%20goo/bar.html'
>>> urllib2.urlopen(quoted_url)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\python25\lib\urllib2.py", line 124, in urlopen
return _opener.open(url, data)
File "c:\python25\lib\urllib2.py", line 373, in open
protocol = req.get_type()
File "c:\python25\lib\urllib2.py", line 244, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: http%3A//www.example.com/foo%20goo/bar.html
#ΤΖΩΤΖΙΟΥ provides a function that uses urlparse.urlparse and urlparse.urlunparse to parse the url and only encode the path. This may be more useful for you, although if you're building the URL from a known protocol and host but with a suspect path, you could probably do just as well to avoid urlparse and just quote the suspect part of the URL, concatenating with known safe parts.
Because this page is a top result for Google searches on the topic, I think it's worth mentioning some work that has been done on URL normalization with Python that goes beyond urlencoding space characters. For example, dealing with default ports, character case, lack of trailing slashes, etc.
When the Atom syndication format was being developed, there was some discussion on how to normalize URLs into canonical format; this is documented in the article PaceCanonicalIds on the Atom/Pie wiki. That article provides some good test cases.
I believe that one result of this discussion was Mark Nottingham's urlnorm.py library, which I've used with good results on a couple projects. That script doesn't work with the URL given in this question, however. So a better choice might be Sam Ruby's version of urlnorm.py, which handles that URL, and all of the aforementioned test cases from the Atom wiki.
Py3
from urllib.parse import urlparse, urlunparse, quote
def myquote(url):
parts = urlparse(url)
return urlunparse(parts._replace(path=quote(parts.path)))
>>> myquote('https://www.example.com/~user/with space/index.html?a=1&b=2')
'https://www.example.com/~user/with%20space/index.html?a=1&b=2'
Py2
import urlparse, urllib
def myquote(url):
parts = urlparse.urlparse(url)
return urlparse.urlunparse(parts[:2] + (urllib.quote(parts[2]),) + parts[3:])
>>> myquote('https://www.example.com/~user/with space/index.html?a=1&b=2')
'https://www.example.com/%7Euser/with%20space/index.html?a=1&b=2'
This quotes only the path component.
Just FYI, urlnorm has moved to github:
http://gist.github.com/246089
Valid for Python 3.5:
import urllib.parse
urllib.parse.quote([your_url], "\./_-:")
example:
import urllib.parse
print(urllib.parse.quote("http://www.example.com/foo goo/bar.html", "\./_-:"))
the output will be http://www.example.com/foo%20goo/bar.html
Font: https://docs.python.org/3.5/library/urllib.parse.html?highlight=quote#urllib.parse.quote
I encounter such an problem: need to quote the space only.
fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'#()*[]") do help, but it's too complicated.
So I used a simple way: url = url.replace(' ', '%20'), it's not perfect, but it's the simplest way and it works for this situation.
A lot of answers here talk about quoting URLs, not about normalizing them.
The best tool to normalize urls (for deduplication etc.) in Python IMO is w3lib's w3lib.url.canonicalize_url util.
Taken from the official docs:
Canonicalize the given url by applying the following procedures:
- sort query arguments, first by key, then by value
percent encode paths ; non-ASCII characters are percent-encoded using UTF-8 (RFC-3986)
- percent encode query arguments ; non-ASCII characters are percent-encoded using passed encoding (UTF-8 by default)
- normalize all spaces (in query arguments) ‘+’ (plus symbol)
- normalize percent encodings case (%2f -> %2F)
- remove query arguments with blank values (unless keep_blank_values is True)
- remove fragments (unless keep_fragments is True)
- List item
The url passed can be bytes or unicode, while the url returned is always a native str (bytes in Python 2, unicode in Python 3).
>>> import w3lib.url
>>>
>>> # sorting query arguments
>>> w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50')
'http://www.example.com/do?a=50&b=2&b=5&c=3'
>>>
>>> # UTF-8 conversion + percent-encoding of non-ASCII characters
>>> w3lib.url.canonicalize_url('http://www.example.com/r\u00e9sum\u00e9')
'http://www.example.com/r%C3%A9sum%C3%A9'
I've used this util with great success when broad crawling the web to avoid duplicate requests because of minor url differences (different parameter order, anchors etc)