Unshorten url in python 3 - python

I am using this code for unshortening urls in python 3 , but the code returns the url as it is (shortened), so what should I do to get it unshortened?
import requests
import http.client
import urllib.parse as urlparse
def unshortenurl(url):
parsed = urlparse.urlparse(url)
h = http.client.HTTPConnection(parsed.netloc)
h.request('HEAD', parsed.path)
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return response.getheader('Location')
else: return url

In python3 response.status/100 == 3 would be True only for status code 300. For any other 3xx code it would be False. Use floor division instead response.status//100 == 3 or some other way to test for redirection codes.
EDIT: It looks you are using the code from SO question posted by #Aybars and there is comment at the top of the snippet what to do in python3. Also, it would have been nice to mention the source of the code.

Related

Python Short Url expander

i have a problem with expanding short URLs, since not all i work with use the same redirection:
the idea is to expand shortened urls: here a few examples of short url --> Final url. I need a function to get the shorten url and return the expanded url
http://chollo.to/675za --> http://www.elcorteingles.es/limite-48-horas/equipaje/?sorting=priceAsc&aff_id=2118094&dclid=COvjy8Xrz9UCFeMi0wod4ZULuw
So fa i have something semi working, it fails in the some of the abobe examples
import requests
import httplib
import urlparse
def unshorten_url(url):
try:
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
h.request('HEAD', parsed.path)
response = h.getresponse()
if response.status / 100 == 3 and response.getheader('Location'):
url = requests.get(response.getheader('Location')).url
print url
return url
else:
url = requests.get(url).url
print url
return url
except Exception as e:
print(e)
The expected redirect does not appear to be well-formed according to requests:
import requests
response = requests.get('http://chollo.to/675za')
for resp in response.history:
print(resp.status_code, resp.url)
print(response.url)
print(response.is_redirect)
Output:
301 http://chollo.to/675za
http://web.epartner.es/click.asp?ref=754218&site=14010&type=text&tnb=39&diurl=https%3A%2F%2Fad.doubleclick.net%2Fddm%2Fclk%2F302111021%3B129203261%3By%3Fhttp%3A%2F%2Fwww.elcorteingles.es%2Flimite-48-horas%2Fequipaje%2F%3Fsorting%3DpriceAsc%26aff_id%3D2118094
False
This is likely intentional by epartner or doubleclick. For these types of nested urls you would need an extra step like:
from urllib.parse import unquote
# from urllib import unquote # python2
# if response.url.count('http') > 1:
url = 'http' + response.url.split('http')[-1]
unquote(url)
# http://www.elcorteingles.es/limite-48-horas/equipaje/?sorting=priceAsc&aff_id=2118094
Note: by doing this you might be avoiding intended ad revenues.

What exactly requests function does?

So I`m trying to send a request to a webpage and read its response. I did a code that compares the request and the page, and I cant get the same page text. Am I using "requests" correctly?
I really think that I misunderstand how requests function works and what it does. Can someone help me please?
import requests
import urllib
def search():
pr = {'q':'pink'}
r = requests.get('http://stackoverflow.com/search',params=pr)
returntext = r.text
urllibtest(returntext)
def urllibtest(returntext):
connection = urllib.urlopen("http://stackoverflow.com/search?q=pink")
output = connection.read()
connection.close()
if output == returntext:
print("ITS THE SAME PAGE")
else:
print("ITS NOT THE SAME PAGE")
search()
First of all, there is no good reason to expect two different stack overflow searches to return the exact same response anyway.
There is one logical difference here too, requests automatically decodes the output for you:
>>> type(output)
str
>>> type(r.text)
unicode
You can use the content instead if you don't want it decoded, and use a more predictable source to see the same content returned - for example:
>>> r1 = urllib.urlopen('http://httpbin.org').read()
>>> r2 = requests.get('http://httpbin.org').content
>>> r1 == r2
True

Python reading json from a url [duplicate]

I am trying to GET a URL using Python and the response is JSON. However, when I run
import urllib2
response = urllib2.urlopen('https://api.instagram.com/v1/tags/pizza/media/XXXXXX')
html=response.read()
print html
The html is of type str and I am expecting a JSON. Is there any way I can capture the response as JSON or a python dictionary instead of a str.
If the URL is returning valid JSON-encoded data, use the json library to decode that:
import urllib2
import json
response = urllib2.urlopen('https://api.instagram.com/v1/tags/pizza/media/XXXXXX')
data = json.load(response)
print data
import json
import urllib
url = 'http://example.com/file.json'
r = urllib.request.urlopen(url)
data = json.loads(r.read().decode(r.info().get_param('charset') or 'utf-8'))
print(data)
urllib, for Python 3.4
HTTPMessage, returned by r.info()
"""
Return JSON to webpage
Adding to wonderful answer by #Sanal
For Django 3.4
Adding a working url that returns a json (Source: http://www.jsontest.com/#echo)
"""
import json
import urllib
url = 'http://echo.jsontest.com/insert-key-here/insert-value-here/key/value'
respons = urllib.request.urlopen(url)
data = json.loads(respons.read().decode(respons.info().get_param('charset') or 'utf-8'))
return HttpResponse(json.dumps(data), content_type="application/json")
Be careful about the validation and etc, but the straight solution is this:
import json
the_dict = json.load(response)
resource_url = 'http://localhost:8080/service/'
response = json.loads(urllib2.urlopen(resource_url).read())
Python 3 standard library one-liner:
load(urlopen(url))
# imports (place these above the code before running it)
from json import load
from urllib.request import urlopen
url = 'https://jsonplaceholder.typicode.com/todos/1'
you can also get json by using requests as below:
import requests
r = requests.get('http://yoursite.com/your-json-pfile.json')
json_response = r.json()
Though I guess it has already answered I would like to add my little bit in this
import json
import urllib2
class Website(object):
def __init__(self,name):
self.name = name
def dump(self):
self.data= urllib2.urlopen(self.name)
return self.data
def convJSON(self):
data= json.load(self.dump())
print data
domain = Website("https://example.com")
domain.convJSON()
Note : object passed to json.load() should support .read() , therefore urllib2.urlopen(self.name).read() would not work .
Doamin passed should be provided with protocol in this case http
This is another simpler solution to your question
pd.read_json(data)
where data is the str output from the following code
response = urlopen("https://data.nasa.gov/resource/y77d-th95.json")
json_data = response.read().decode('utf-8', 'replace')
None of the provided examples on here worked for me. They were either for Python 2 (uurllib2) or those for Python 3 return the error "ImportError: No module named request". I google the error message and it apparently requires me to install a the module - which is obviously unacceptable for such a simple task.
This code worked for me:
import json,urllib
data = urllib.urlopen("https://api.github.com/users?since=0").read()
d = json.loads(data)
print (d)

Convert results from url lib.request [duplicate]

I am trying to GET a URL using Python and the response is JSON. However, when I run
import urllib2
response = urllib2.urlopen('https://api.instagram.com/v1/tags/pizza/media/XXXXXX')
html=response.read()
print html
The html is of type str and I am expecting a JSON. Is there any way I can capture the response as JSON or a python dictionary instead of a str.
If the URL is returning valid JSON-encoded data, use the json library to decode that:
import urllib2
import json
response = urllib2.urlopen('https://api.instagram.com/v1/tags/pizza/media/XXXXXX')
data = json.load(response)
print data
import json
import urllib
url = 'http://example.com/file.json'
r = urllib.request.urlopen(url)
data = json.loads(r.read().decode(r.info().get_param('charset') or 'utf-8'))
print(data)
urllib, for Python 3.4
HTTPMessage, returned by r.info()
"""
Return JSON to webpage
Adding to wonderful answer by #Sanal
For Django 3.4
Adding a working url that returns a json (Source: http://www.jsontest.com/#echo)
"""
import json
import urllib
url = 'http://echo.jsontest.com/insert-key-here/insert-value-here/key/value'
respons = urllib.request.urlopen(url)
data = json.loads(respons.read().decode(respons.info().get_param('charset') or 'utf-8'))
return HttpResponse(json.dumps(data), content_type="application/json")
Be careful about the validation and etc, but the straight solution is this:
import json
the_dict = json.load(response)
resource_url = 'http://localhost:8080/service/'
response = json.loads(urllib2.urlopen(resource_url).read())
Python 3 standard library one-liner:
load(urlopen(url))
# imports (place these above the code before running it)
from json import load
from urllib.request import urlopen
url = 'https://jsonplaceholder.typicode.com/todos/1'
you can also get json by using requests as below:
import requests
r = requests.get('http://yoursite.com/your-json-pfile.json')
json_response = r.json()
Though I guess it has already answered I would like to add my little bit in this
import json
import urllib2
class Website(object):
def __init__(self,name):
self.name = name
def dump(self):
self.data= urllib2.urlopen(self.name)
return self.data
def convJSON(self):
data= json.load(self.dump())
print data
domain = Website("https://example.com")
domain.convJSON()
Note : object passed to json.load() should support .read() , therefore urllib2.urlopen(self.name).read() would not work .
Doamin passed should be provided with protocol in this case http
This is another simpler solution to your question
pd.read_json(data)
where data is the str output from the following code
response = urlopen("https://data.nasa.gov/resource/y77d-th95.json")
json_data = response.read().decode('utf-8', 'replace')
None of the provided examples on here worked for me. They were either for Python 2 (uurllib2) or those for Python 3 return the error "ImportError: No module named request". I google the error message and it apparently requires me to install a the module - which is obviously unacceptable for such a simple task.
This code worked for me:
import json,urllib
data = urllib.urlopen("https://api.github.com/users?since=0").read()
d = json.loads(data)
print (d)

Unshorten Flic.kr URLs

I have a Python script that unshortens URLs based on the answer posted here. So far it worked pretty well, e.g., with youtu.be, goo.gl,t.co, bit.ly, and tinyurl.com. But now I noticed that it doesn't work for Flickr's own URL shortener flic.kr.
For example, when I enter the URL
https://flic.kr/p/qf3mGd
into a browser, I get redirected correctly to
https://www.flickr.com/photos/106783633#N02/15911453212/
However, when using to unshorten the same URL with the Python script I get the following re-directs
https://flic.kr/p/qf3mgd
http://www.flickr.com/photo.gne?short=qf3mgd
http://www.flickr.com/signin/?acf=%2Fphoto.gne%3Fshort%3Dqf3mgd
https://login.yahoo.com/config/login?.src=flickrsignin&.pc=8190&.scrumb=[...]
thus eventually ending up on the Yahoo login page. Unshort.me, by the way, can unshorten the URL correctly. What am I missing here?
Here is the full source code of my script. I stumbled upon some pathological cases with the original script:
import urlparse
import httplib
def unshorten_url(url, max_tries=10):
return __unshorten_url(url, [], max_tries)
def __unshorten_url(url, check_urls, max_tries):
if max_tries == 0:
if len(check_urls) > 0:
return check_urls[0]
return url
if url in check_urls:
return url
unshortended = ''
try:
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
h.request('HEAD', url)
except:
return None
try:
response = h.getresponse()
except:
return url
if response.status/100 == 3 and response.getheader('Location'):
unshortended = response.getheader('Location')
else:
return url
#print max_tries, unshortended
if unshortended != url:
if 'http' not in unshortended:
return url
check_urls.append(url)
return __unshorten_url(unshortended, check_urls, (max_tries-1))
else:
return unshortended
print unshorten_url('http://t.co/5skmePb7gp')
EDIT: Full working example with a t.co URL
I'm using Request [0] rather than httplib in this way and it's works fine with https://flic.kr/p/qf3mGd like urls:
>>> import requests
>>> requests.head("https://flic.kr/p/qf3mGd", allow_redirects=True, verify=False).url
u'https://www.flickr.com/photos/106783633#N02/15911453212/'
[0] http://docs.python-requests.org/en/latest/

Categories

Resources