Build a URL using Requests module Python - python

Is it possible to build a URL using the Requests library for Python?
Building a query string is supported but what about building the rest of the URL. Specifically I'd be interested in adding on to the base URL with URL encoded strings:
http :// some address.com/api/[term]/
term = 'This is a test'
http :// some address.com/api/This+is+a+test/
This is presumably possible using urllib but it seems like it would be better in Requests. Does this feature exist? If not is there a good reason that it shouldn't?

requests is basically a convenient wrapper around urllib (and 2,3 and other related libraries).
You can import urljoin(), quote() from requests.compat, but this is essentially the same as using them directly from urllib and urlparse modules:
>>> from requests.compat import urljoin, quote_plus
>>> url = "http://some-address.com/api/"
>>> term = 'This is a test'
>>> urljoin(url, quote_plus(term))
'http://some-address.com/api/This+is+a+test'

Related

Convert urllib2 python code to use urllib module

I have the following code below which runs using the urllib2 module, but I have a requirement to upgrade to Python 3.x and this prevents the use of urllib2. I am aware it is split across urllib.request and urllib.error, but I am struggling to convert the following code to use the urllib module instead after reading through the doc and other relevant questions. Any help is greatly appreciated.
opener = urllib2.build_opener(urllib2.HTTPHandler)
request = urllib2.Request(url=event['ResponseURL'], data=data)
request.add_header('Content-Type', '')
request.get_method = lambda: 'PUT'
url = opener.open(request)
All you need to do is replace urllib2 with urllib.request. You are not using anything that has moved to other urllib.* modules:
import urllib.request
opener = urllib.request.build_opener(urllib.request.HTTPHandler)
request = urllib.request.Request(url=event['ResponseURL'], data=data)
request.add_header('Content-Type', '')
request.get_method = lambda: 'PUT'
url = opener.open(request)
You can always run the 2to3 command-line tool on your Python 2 code and see what changes it makes; the default action is to output changes on stdout in unified diff format.
The urllib fixer will then also add imports for urllib.error and urllib.parse at the top, because it knows that code that imported urllib2 could need any of the 3 urllib.* modules; it isn't smart enough to limit the import only to those that are actually needed after transforming the rest of the urllib2 references in the module.

Redact and remove password from URL

I have an URL like this:
https://user:password#example.com/path?key=value#hash
The result should be:
https://user:???#example.com/path?key=value#hash
I could use a regex, but instead I would like to parse the URL a high level data structure, then operate on this data structure, then serializing to a string.
Is this possible with Python?
You can use the built in urlparse to query out the password from a url. It is available in both Python 2 and 3, but under different locations.
Python 2 import urlparse
Python 3 from urllib.parse import urlparse
Example
from urllib.parse import urlparse
parsed = urlparse("https://user:password#example.com/path?key=value#hash")
parsed.password # 'password'
replaced = parsed._replace(netloc="{}:{}#{}".format(parsed.username, "???", parsed.hostname))
replaced.geturl() # 'https://user:???#example.com/path?key=value#hash'
See also this question: Changing hostname in a url
from urllib.parse import urlparse
def redact_url(url: str) -> str:
url_components = urlparse(url)
if url_components.username or url_components.password:
url_components = url_components._replace(
netloc=f"{url_components.username}:???#{url_components.hostname}",
)
return url_components.geturl()
The pip module already have an internal utility function which does exactly this.
>>> from pip._internal.utils.misc import redact_auth_from_url
>>>
>>> redact_auth_from_url("https://user:password#example.com/path?key=value#hash")
'https://user:****#example.com/path?key=value#hash'
>>> redact_auth_from_url.__doc__
'Replace the password in a given url with ****.'
This will provide the expected result even if the url does not contain username or password.
>>> redact_auth_from_url("https://example.com/path?key=value#hash")
'https://example.com/path?key=value#hash'

Download file as string in python

I want to download a file to python as a string. I have tried the following, but it doesn't seem to work. What am I doing wrong, or what else might I do?
from urllib import request
webFile = request.urlopen(url).read()
print(webFile)
The following example works.
from urllib.request import urlopen
url = 'http://winterolympicsmedals.com/medals.csv'
output = urlopen(url).read()
print(output.decode('utf-8'))
Alternatively, you could use requests which provides a more human readable syntax. Keep in mind that requests requires that you install additional dependencies, which may increase the complexity of deploying the application, depending on your production enviornment.
import requests
url = 'http://winterolympicsmedals.com/medals.csv'
output = requests.get(url).text
print(output)
In Python3.x, using package 'urllib' like this:
from urllib.request import urlopen
data = urlopen('http://www.google.com').read() #bytes
body = data.decode('utf-8')
Another good library for this is http://docs.python-requests.org
It's not built-in, but I've found it to be much more usable than urllib*.

What can be used instead of parse_qs function

I have the following code for parsing youtube feed and returning youtube movie id. How can I rewrite this to be python 2.4 compatible which I suppose doesn't support parse_qs function ?
YTSearchFeed = feedparser.parse("http://gdata.youtube.com" + path)
videos = []
for yt in YTSearchFeed.entries:
url_data = urlparse.urlparse(yt['link'])
query = urlparse.parse_qs(url_data[4])
id = query["v"][0]
videos.append(id)
I assume your existing code runs in 2.6 or something newer, and you're trying to go back to 2.4? parse_qs used to be in the cgi module before it was moved to urlparse. Try import cgi, cgi.parse_qs.
Inspired by TryPyPy's comment, I think you could make your source run in either environment by doing:
import urlparse # if we're pre-2.6, this will not include parse_qs
try:
from urlparse import parse_qs
except ImportError: # old version, grab it from cgi
from cgi import parse_qs
urlparse.parse_qs = parse_qs
But I don't have 2.4 to try this out, so no promises.
I tried that, and still.. it wasn't working.
It's easier to simply copy the parse_qs/qsl functions over from the cgi module to the urlparse module.
Problem solved.

How to fetch a non-ascii url with urlopen?

I need to fetch data from a URL with non-ascii characters but urllib2.urlopen refuses to open the resource and raises:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 26: ordinal not in range(128)
I know the URL is not standards compliant but I have no chance to change it.
What is the way to access a resource pointed by a URL containing non-ascii characters using Python?
edit: In other words, can / how urlopen open a URL like:
http://example.org/Ñöñ-ÅŞÇİİ/
Strictly speaking URIs can't contain non-ASCII characters; what you have there is an IRI.
To convert an IRI to a plain ASCII URI:
non-ASCII characters in the hostname part of the address have to be encoded using the Punycode-based IDNA algorithm;
non-ASCII characters in the path, and most of the other parts of the address have to be encoded using UTF-8 and %-encoding, as per Ignacio's answer.
So:
import re, urlparse
def urlEncodeNonAscii(b):
return re.sub('[\x80-\xFF]', lambda c: '%%%02x' % ord(c.group(0)), b)
def iriToUri(iri):
parts= urlparse.urlparse(iri)
return urlparse.urlunparse(
part.encode('idna') if parti==1 else urlEncodeNonAscii(part.encode('utf-8'))
for parti, part in enumerate(parts)
)
>>> iriToUri(u'http://www.a\u0131b.com/a\u0131b')
'http://www.xn--ab-hpa.com/a%c4%b1b'
(Technically this still isn't quite good enough in the general case because urlparse doesn't split away any user:pass# prefix or :port suffix on the hostname. Only the hostname part should be IDNA encoded. It's easier to encode using normal urllib.quote and .encode('idna') at the time you're constructing a URL than to have to pull an IRI apart.)
In python3, use the urllib.parse.quote function on the non-ascii string:
>>> from urllib.request import urlopen
>>> from urllib.parse import quote
>>> chinese_wikipedia = 'http://zh.wikipedia.org/wiki/Wikipedia:' + quote('首页')
>>> urlopen(chinese_wikipedia)
Python 3 has libraries to handle this situation. Use
urllib.parse.urlsplit to split the URL into its components, and
urllib.parse.quote to properly quote/escape the unicode characters
and urllib.parse.urlunsplit to join it back together.
>>> import urllib.parse
>>> url = 'http://example.com/unicodè'
>>> url = urllib.parse.urlsplit(url)
>>> url = list(url)
>>> url[2] = urllib.parse.quote(url[2])
>>> url = urllib.parse.urlunsplit(url)
>>> print(url)
http://example.com/unicod%C3%A8
It is more complex than the accepted #bobince's answer suggests:
netloc should be encoded using IDNA;
non-ascii URL path should be encoded to UTF-8 and then percent-escaped;
non-ascii query parameters should be encoded to the encoding of a page URL was extracted from (or to the encoding server uses), then percent-escaped.
This is how all browsers work; it is specified in https://url.spec.whatwg.org/ - see this example. A Python implementation can be found in w3lib (this is the library Scrapy is using); see w3lib.url.safe_url_string:
from w3lib.url import safe_url_string
url = safe_url_string(u'http://example.org/Ñöñ-ÅŞÇİİ/', encoding="<page encoding>")
An easy way to check if a URL escaping implementation is incorrect/incomplete is to check if it provides 'page encoding' argument or not.
Based on #darkfeline answer:
from urllib.parse import urlsplit, urlunsplit, quote
def iri2uri(iri):
"""
Convert an IRI to a URI (Python 3).
"""
uri = ''
if isinstance(iri, str):
(scheme, netloc, path, query, fragment) = urlsplit(iri)
scheme = quote(scheme)
netloc = netloc.encode('idna').decode('utf-8')
path = quote(path)
query = quote(query)
fragment = quote(fragment)
uri = urlunsplit((scheme, netloc, path, query, fragment))
return uri
For those not depending strictly on urllib, one practical alternative is requests, which handles IRIs "out of the box".
For example, with http://bücher.ch:
>>> import requests
>>> r = requests.get(u'http://b\u00DCcher.ch')
>>> r.status_code
200
Encode the unicode to UTF-8, then URL-encode.
Use iri2uri method of httplib2. It makes the same thing as by bobin (is he/she the author of that?)
Another option to convert an IRI to an ASCII URI is to use furl package:
gruns/furl: 🌐 URL parsing and manipulation made easy. - https://github.com/gruns/furl
Python's standard urllib and urlparse modules provide a number of URL
related functions, but using these functions to perform common URL
operations proves tedious. Furl makes parsing and manipulating URLs
easy.
Examples
Non-ASCII domain
http://国立極地研究所.jp/english/ (Japanese National Institute of Polar Research website)
import furl
url = 'http://国立極地研究所.jp/english/'
furl.furl(url).tostr()
'http://xn--vcsoey76a2hh0vtuid5qa.jp/english/'
Non-ASCII path
https://ja.wikipedia.org/wiki/日本語 ("Japanese" article in Wikipedia)
import furl
url = 'https://ja.wikipedia.org/wiki/日本語'
furl.furl(url).tostr()
'https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E8%AA%9E'
works! finally
I could not avoid from this strange characters, but at the end I come through it.
import urllib.request
import os
url = "http://www.fourtourismblog.it/le-nuove-tendenze-del-marketing-tenere-docchio/"
with urllib.request.urlopen(url) as file:
html = file.read()
with open("marketingturismo.html", "w", encoding='utf-8') as file:
file.write(str(html.decode('utf-8')))
os.system("marketingturismo.html")

Categories

Resources