I want to be able to take a shortened or non-shortened URL and return its un-shortened form. How can I make a python program to do this?
Additional Clarification:
Case 1: shortened --> unshortened
Case 2: unshortened --> unshortened
e.g. bit.ly/silly in the input array should be google.com in the output array
e.g. google.com in the input array should be google.com in the output array
Send an HTTP HEAD request to the URL and look at the response code. If the code is 30x, look at the Location header to get the unshortened URL. Otherwise, if the code is 20x, then the URL is not redirected; you probably also want to handle error codes (4xx and 5xx) in some fashion. For example:
# This is for Py2k. For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
h.request('HEAD', parsed.path)
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return response.getheader('Location')
else:
return url
Using requests:
import requests
session = requests.Session() # so connections are recycled
resp = session.head(url, allow_redirects=True)
print(resp.url)
Unshorten.me has an api that lets you send a JSON or XML request and get the full URL returned.
If you are using Python 3.5+ you can use the Unshortenit module that makes this very easy:
from unshortenit import UnshortenIt
unshortener = UnshortenIt()
uri = unshortener.unshorten('https://href.li/?https://example.com')
Open the url and see what it resolves to:
>>> import urllib2
>>> a = urllib2.urlopen('http://bit.ly/cXEInp')
>>> print a.url
http://www.flickr.com/photos/26432908#N00/346615997/sizes/l/
>>> a = urllib2.urlopen('http://google.com')
>>> print a.url
http://www.google.com/
To unshort, you can use requests. This is a simple solution that works for me.
import requests
url = "http://foo.com"
site = requests.get(url)
print(site.url)
http://github.com/stef/urlclean
sudo pip install urlclean
urlclean.unshorten(url)
Here a src code that takes into account almost of the useful corner cases:
set a custom Timeout.
set a custom User Agent.
check whether we have to use an http or https connection.
resolve recursively the input url and prevent ending within a loop.
The src code is on github # https://github.com/amirkrifa/UnShortenUrl
comments are welcome ...
import logging
logging.basicConfig(level=logging.DEBUG)
TIMEOUT = 10
class UnShortenUrl:
def process(self, url, previous_url=None):
logging.info('Init url: %s'%url)
import urlparse
import httplib
try:
parsed = urlparse.urlparse(url)
if parsed.scheme == 'https':
h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
else:
h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
try:
h.request('HEAD',
resource,
headers={'User-Agent': 'curl/7.38.0'}
}
)
response = h.getresponse()
except:
import traceback
traceback.print_exec()
return url
logging.info('Response status: %d'%response.status)
if response.status/100 == 3 and response.getheader('Location'):
red_url = response.getheader('Location')
logging.info('Red, previous: %s, %s'%(red_url, previous_url))
if red_url == previous_url:
return red_url
return self.process(red_url, previous_url=url)
else:
return url
except:
import traceback
traceback.print_exc()
return None
You can use geturl()
from urllib.request import urlopen
url = "bit.ly/silly"
unshortened_url = urlopen(url).geturl()
print(unshortened_url)
# google.com
This Is very easy task you just need to add 4 lines of codes thats it :)
import requests
url = input('Enter url : ')
site = requests.get(url)
print(site.url)
just run this code you will successfully unshort the url.
Related
I'm trying to GET an URL of the following format using requests.get() in python:
http://api.example.com/export/?format=json&key=site:dummy+type:example+group:wheel
#!/usr/local/bin/python
import requests
print(requests.__versiom__)
url = 'http://api.example.com/export/'
payload = {'format': 'json', 'key': 'site:dummy+type:example+group:wheel'}
r = requests.get(url, params=payload)
print(r.url)
However, the URL gets percent encoded and I don't get the expected response.
2.2.1
http://api.example.com/export/?key=site%3Adummy%2Btype%3Aexample%2Bgroup%3Awheel&format=json
This works if I pass the URL directly:
url = http://api.example.com/export/?format=json&key=site:dummy+type:example+group:wheel
r = requests.get(url)
Is there some way to pass the the parameters in their original form - without percent encoding?
Thanks!
It is not good solution but you can use directly string:
r = requests.get(url, params='format=json&key=site:dummy+type:example+group:wheel')
BTW:
Code which convert payload to this string
payload = {
'format': 'json',
'key': 'site:dummy+type:example+group:wheel'
}
payload_str = "&".join("%s=%s" % (k,v) for k,v in payload.items())
# 'format=json&key=site:dummy+type:example+group:wheel'
r = requests.get(url, params=payload_str)
EDIT (2020):
You can also use urllib.parse.urlencode(...) with parameter safe=':+' to create string without converting chars :+ .
As I know requests also use urllib.parse.urlencode(...) for this but without safe=.
import requests
import urllib.parse
payload = {
'format': 'json',
'key': 'site:dummy+type:example+group:wheel'
}
payload_str = urllib.parse.urlencode(payload, safe=':+')
# 'format=json&key=site:dummy+type:example+group:wheel'
url = 'https://httpbin.org/get'
r = requests.get(url, params=payload_str)
print(r.text)
I used page https://httpbin.org/get to test it.
In case someone else comes across this in the future, you can subclass requests.Session, override the send method, and alter the raw url, to fix percent encodings and the like.
Corrections to the below are welcome.
import requests, urllib
class NoQuotedCommasSession(requests.Session):
def send(self, *a, **kw):
# a[0] is prepared request
a[0].url = a[0].url.replace(urllib.parse.quote(","), ",")
return requests.Session.send(self, *a, **kw)
s = NoQuotedCommasSession()
s.get("http://somesite.com/an,url,with,commas,that,won't,be,encoded.")
The solution, as designed, is to pass the URL directly.
The answers above didn't work for me.
I was trying to do a get request where the parameter contained a pipe, but python requests would also percent encode the pipe. So
instead i used urlopen:
# python3
from urllib.request import urlopen
base_url = 'http://www.example.com/search?'
query = 'date_range=2017-01-01|2017-03-01'
url = base_url + query
response = urlopen(url)
data = response.read()
# response data valid
print(response.url)
# output: 'http://www.example.com/search?date_range=2017-01-01|2017-03-01'
All above solutions don't seem to work anymore from requests version 2.26 on. The suggested solution from the GitHub repo seems to be using a work around with a PreparedRequest.
The following worked for me. Make sure the URL is resolvable, so don't use 'this-is-not-a-domain.com'.
import requests
base_url = 'https://www.example.com/search'
query = '?format=json&key=site:dummy+type:example+group:wheel'
s = requests.Session()
req = requests.Request('GET', base_url)
p = req.prepare()
p.url += query
resp = s.send(p)
print(resp.request.url)
Source: https://github.com/psf/requests/issues/5964#issuecomment-949013046
Please have a look at the 1st option in this github link. You can ignore the urlibpart which means prep.url = url instead of prep.url = url + qry
can someone tell me how to check the statuscode of a HTTP response with http.client? I didn't find anything specifically to that in the documentary of http.client.
Code would look like this:
if conn.getresponse():
return True #Statuscode = 200
else:
return False #Statuscode != 200
My code looks like that:
from urllib.parse import urlparse
import http.client, sys
def check_url(url):
url = urlparse(url)
conn = http.client.HTTPConnection(url.netloc)
conn.request("HEAD", url.path)
r = conn.getresponse()
if r.status == 200:
return True
else:
return False
if __name__ == "__main__":
input_url=input("Enter the website to be checked (beginning with www):")
url = "http://"+input_url
url_https = "https://"+input_url
if check_url(url_https):
print("The entered Website supports HTTPS.")
else:
if check_url(url):
print("The entered Website doesn't support HTTPS, but supports HTTP.")
if check_url(url):
print("The entered Website supports HTTP too.")
Take a look at the documentation here, you simply needs to do:
r = conn.getresponse()
print(r.status, r.reason)
Update: If you want (as said in the comments) to check an http connection, you could eventually use an HTTPConnection and read the status:
import http.client
conn = http.client.HTTPConnection("docs.python.org")
conn.request("GET", "/")
r1 = conn.getresponse()
print(r1.status, r1.reason)
If the website is correctly configured to implement HTTPS, you should not have a status code 200; In this example, you'll get a 301 Moved Permanently response, which means the request was redirected, in this case rewritten to HTTPS .
I want to verify if a url is video raw file link or not, for example:
http://hidden_path/video_name.mp4
Below is my current code:
def is_video(url):
r = None
try:
r = urllib.request.urlopen(urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'}))
except:
return False
content_type = r.getheader("Content-Type")
if re.match("video*", content_type):
return True
return False
This code will have issue if the video url is a big video, and it may cause timeout error on server.
Are there any better approaches?
If you just want to check the Content-Type of the header you can send a HEAD request instead of the GET.
Once you have obtained the response from the HEAD request you can check for video in the Content-Type header as above.
Example:
>>> req = urllib.request.Request(url, method='HEAD', headers={'User-Agent': 'Mozilla/5.0'})
>>> r = urllib.request.urlopen(req)
>>> r.getheader('Content-Type')
'video/mp4'
Hope this does it
import mimetypes
url = 'http://media.theaterchurch.com/podcast/video/hd/720p/2016/05-08-16-720p.mp4'
print mimetypes.MimeTypes().guess_type(url)[0]
outputs this...
video/mp4
How to get cookie from an urllib.request?
import urllib.request
import urllib.parse
data = urllib.parse.urlencode({
'user': 'user',
'pass': 'pass'
})
data = data.encode('utf-8')
request = urllib.request.urlopen('http://example.com', data)
print(request.info())
request.info() returns cookies but not in very usable way.
response.info() is a dict type object. so you can parse any info you need. Here is a demo written in python3:
from urllib import request
from urllib.error import HTTPError
# declare url, header_params
req = request.Request(url, data=None, headers=header_params, method='GET')
try:
response = request.urlopen(req)
cookie = response.info().get_all('Set-Cookie')
content_type = response.info()['Content-Type']
except HTTPError as err:
print("err status: {0}".format(err))
return
You can now, parse cookie variable as your application requirement.
Just used the following code to get cookie from Python Challenge #17, hope it helps (Python 3.8 being used):
import http.cookiejar
import urllib
cookiejar = http.cookiejar.CookieJar()
cookieproc = urllib.request.HTTPCookieProcessor(cookiejar)
opener = urllib.request.build_opener(cookieproc)
response = opener.open(url)
for cookie in cookiejar:
print(cookie.name, cookie.value)
I think using the requests package is a much better choice these days. Try this sample code that shows google setting cookies when you visit:
import requests
url = "http://www.google.com"
r = requests.get(url,timeout=5)
if r.status_code == 200:
for cookie in r.cookies:
print(cookie) # Use "print cookie" if you use Python 2.
Gives:
Cookie NID=67=n0l3ME1Jl3-wwlH7oE5pvxJ_CfU12hT5Kh65wh21bvE3hrKFAo1sJVj_UcuLCr76Ubi3yxENROaYNEitdgW4IttL43YZGlf8xAPl1IbzoLG31KP5U2tiP2y4DzVOJ2fA for .google.se/
Cookie PREF=ID=ce66d1288fc0d977:FF=0:TM=1407525509:LM=1407525509:S=LxQv7q8fju-iHJPZ for .google.se/
I'm trying to GET an URL of the following format using requests.get() in python:
http://api.example.com/export/?format=json&key=site:dummy+type:example+group:wheel
#!/usr/local/bin/python
import requests
print(requests.__versiom__)
url = 'http://api.example.com/export/'
payload = {'format': 'json', 'key': 'site:dummy+type:example+group:wheel'}
r = requests.get(url, params=payload)
print(r.url)
However, the URL gets percent encoded and I don't get the expected response.
2.2.1
http://api.example.com/export/?key=site%3Adummy%2Btype%3Aexample%2Bgroup%3Awheel&format=json
This works if I pass the URL directly:
url = http://api.example.com/export/?format=json&key=site:dummy+type:example+group:wheel
r = requests.get(url)
Is there some way to pass the the parameters in their original form - without percent encoding?
Thanks!
It is not good solution but you can use directly string:
r = requests.get(url, params='format=json&key=site:dummy+type:example+group:wheel')
BTW:
Code which convert payload to this string
payload = {
'format': 'json',
'key': 'site:dummy+type:example+group:wheel'
}
payload_str = "&".join("%s=%s" % (k,v) for k,v in payload.items())
# 'format=json&key=site:dummy+type:example+group:wheel'
r = requests.get(url, params=payload_str)
EDIT (2020):
You can also use urllib.parse.urlencode(...) with parameter safe=':+' to create string without converting chars :+ .
As I know requests also use urllib.parse.urlencode(...) for this but without safe=.
import requests
import urllib.parse
payload = {
'format': 'json',
'key': 'site:dummy+type:example+group:wheel'
}
payload_str = urllib.parse.urlencode(payload, safe=':+')
# 'format=json&key=site:dummy+type:example+group:wheel'
url = 'https://httpbin.org/get'
r = requests.get(url, params=payload_str)
print(r.text)
I used page https://httpbin.org/get to test it.
In case someone else comes across this in the future, you can subclass requests.Session, override the send method, and alter the raw url, to fix percent encodings and the like.
Corrections to the below are welcome.
import requests, urllib
class NoQuotedCommasSession(requests.Session):
def send(self, *a, **kw):
# a[0] is prepared request
a[0].url = a[0].url.replace(urllib.parse.quote(","), ",")
return requests.Session.send(self, *a, **kw)
s = NoQuotedCommasSession()
s.get("http://somesite.com/an,url,with,commas,that,won't,be,encoded.")
The solution, as designed, is to pass the URL directly.
The answers above didn't work for me.
I was trying to do a get request where the parameter contained a pipe, but python requests would also percent encode the pipe. So
instead i used urlopen:
# python3
from urllib.request import urlopen
base_url = 'http://www.example.com/search?'
query = 'date_range=2017-01-01|2017-03-01'
url = base_url + query
response = urlopen(url)
data = response.read()
# response data valid
print(response.url)
# output: 'http://www.example.com/search?date_range=2017-01-01|2017-03-01'
All above solutions don't seem to work anymore from requests version 2.26 on. The suggested solution from the GitHub repo seems to be using a work around with a PreparedRequest.
The following worked for me. Make sure the URL is resolvable, so don't use 'this-is-not-a-domain.com'.
import requests
base_url = 'https://www.example.com/search'
query = '?format=json&key=site:dummy+type:example+group:wheel'
s = requests.Session()
req = requests.Request('GET', base_url)
p = req.prepare()
p.url += query
resp = s.send(p)
print(resp.request.url)
Source: https://github.com/psf/requests/issues/5964#issuecomment-949013046
Please have a look at the 1st option in this github link. You can ignore the urlibpart which means prep.url = url instead of prep.url = url + qry