I have a Python script that unshortens URLs based on the answer posted here. So far it worked pretty well, e.g., with youtu.be, goo.gl,t.co, bit.ly, and tinyurl.com. But now I noticed that it doesn't work for Flickr's own URL shortener flic.kr.
For example, when I enter the URL
https://flic.kr/p/qf3mGd
into a browser, I get redirected correctly to
https://www.flickr.com/photos/106783633#N02/15911453212/
However, when using to unshorten the same URL with the Python script I get the following re-directs
https://flic.kr/p/qf3mgd
http://www.flickr.com/photo.gne?short=qf3mgd
http://www.flickr.com/signin/?acf=%2Fphoto.gne%3Fshort%3Dqf3mgd
https://login.yahoo.com/config/login?.src=flickrsignin&.pc=8190&.scrumb=[...]
thus eventually ending up on the Yahoo login page. Unshort.me, by the way, can unshorten the URL correctly. What am I missing here?
Here is the full source code of my script. I stumbled upon some pathological cases with the original script:
import urlparse
import httplib
def unshorten_url(url, max_tries=10):
return __unshorten_url(url, [], max_tries)
def __unshorten_url(url, check_urls, max_tries):
if max_tries == 0:
if len(check_urls) > 0:
return check_urls[0]
return url
if url in check_urls:
return url
unshortended = ''
try:
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
h.request('HEAD', url)
except:
return None
try:
response = h.getresponse()
except:
return url
if response.status/100 == 3 and response.getheader('Location'):
unshortended = response.getheader('Location')
else:
return url
#print max_tries, unshortended
if unshortended != url:
if 'http' not in unshortended:
return url
check_urls.append(url)
return __unshorten_url(unshortended, check_urls, (max_tries-1))
else:
return unshortended
print unshorten_url('http://t.co/5skmePb7gp')
EDIT: Full working example with a t.co URL
I'm using Request [0] rather than httplib in this way and it's works fine with https://flic.kr/p/qf3mGd like urls:
>>> import requests
>>> requests.head("https://flic.kr/p/qf3mGd", allow_redirects=True, verify=False).url
u'https://www.flickr.com/photos/106783633#N02/15911453212/'
[0] http://docs.python-requests.org/en/latest/
Related
I was trying to create an instagram post downloader bot with python:
import requests
import re
#get url's detail
def get_response(url):
r = requests.get(url)
while r.status_code != 200:
r = requests.get(url)
return r.text
def prepare_urls(matches):
return list({match.replace("\\u0026", "&") for match in matches})
url = input('Enter Instagram URL: ')
response = get_response(url)
#check if there is video url or picture url in the json webpage that is opened
vid_matches = re.findall('"video_url":"([^"]+)"', response)
pic_matches = re.findall('"display_url":"([^"]+)"', response)
vid_urls = prepare_urls(vid_matches)
pic_urls = prepare_urls(pic_matches)
if vid_urls:
print('Detected Videos:\n{0}'.format('\n'.join(vid_urls)))
if pic_urls:
print('Detected Pictures:\n{0}'.format('\n'.join(pic_urls)))
if not (vid_urls or pic_urls):
print('Could not recognize the media in the provided URL.')
After I finished the code, I tried it with a video link and it worked . After 1 hour I tried the same video link but it prints third condition :"Could not recognize the media in the provided URL.".
I'm confused . As you see , I never used my login credentials in the code but first time it works and second time not works...
Any ideas?
Make it so that each URL ends with the string "?__a=1" (When I have some free time, I'll edit this post and add the exact command to append the string to the URL's end.)
For example, instead of:
https://www.instagram.com/p/CECsuu2BgXj/
It should be:
https://www.instagram.com/p/CECsuu2BgXj/?__a=1
Output:
Detected Videos:
https://instagram.fdet1-2.fna.fbcdn.net/v/t50.2886-16/117817389_1889475617843249_1329686959743847420_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InZ0c192b2RfdXJsZ2VuLjcyMC5jbGlwcy5kZWZhdWx0IiwicWVfZ3JvdXBzIjoiW1wiaWdfd2ViX2RlbGl2ZXJ5X3Z0c19vdGZcIl0ifQ&_nc_ht=instagram.fdet1-2.fna.fbcdn.net&_nc_cat=105&_nc_ohc=OZRYx-3yUoAAX-b1xzZ&edm=AABBvjUBAAAA&vs=17858436890092651_3299599943&_nc_vs=HBksFQAYJEdDM0FCUWN4YUFQVGQ3WUdBUHhMQUxJXy0zTVNicV9FQUFBRhUAAsgBABUAGCRHQ0hOQ2dkbFlrcEYwOWtDQUtHQ0RqWUV4cGdzYnFfRUFBQUYVAgLIAQAoABgAGwAVAAAm1onK7OqJuT8VAigCQzMsF0AkmZmZmZmaGBJkYXNoX2Jhc2VsaW5lXzFfdjERAHX%2BBwA%3D&ccb=7-4&oe=6200F187&oh=00_AT-WTSxaoeTOd_GO0gMtqSqkgRXtxibffFG5pJGyCOPTNQ&_nc_sid=83d603
Detected Pictures:
https://instagram.fdet1-1.fna.fbcdn.net/v/t51.2885-15/e35/117915347_192544875567579_944852773653606759_n.jpg?_nc_ht=instagram.fdet1-1.fna.fbcdn.net&_nc_cat=103&_nc_ohc=0Bdvog7HWe8AX-3vsql&edm=AABBvjUBAAAA&ccb=7-4&oh=00_AT_O33BzV3tCKaDp_9eqeBUiYgyzVguImltLTuPIPKP4hg&oe=6201035F&_nc_sid=83d603
For more info, check out this awesome post.
I am using this code for unshortening urls in python 3 , but the code returns the url as it is (shortened), so what should I do to get it unshortened?
import requests
import http.client
import urllib.parse as urlparse
def unshortenurl(url):
parsed = urlparse.urlparse(url)
h = http.client.HTTPConnection(parsed.netloc)
h.request('HEAD', parsed.path)
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return response.getheader('Location')
else: return url
In python3 response.status/100 == 3 would be True only for status code 300. For any other 3xx code it would be False. Use floor division instead response.status//100 == 3 or some other way to test for redirection codes.
EDIT: It looks you are using the code from SO question posted by #Aybars and there is comment at the top of the snippet what to do in python3. Also, it would have been nice to mention the source of the code.
i have a problem with expanding short URLs, since not all i work with use the same redirection:
the idea is to expand shortened urls: here a few examples of short url --> Final url. I need a function to get the shorten url and return the expanded url
http://chollo.to/675za --> http://www.elcorteingles.es/limite-48-horas/equipaje/?sorting=priceAsc&aff_id=2118094&dclid=COvjy8Xrz9UCFeMi0wod4ZULuw
So fa i have something semi working, it fails in the some of the abobe examples
import requests
import httplib
import urlparse
def unshorten_url(url):
try:
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
h.request('HEAD', parsed.path)
response = h.getresponse()
if response.status / 100 == 3 and response.getheader('Location'):
url = requests.get(response.getheader('Location')).url
print url
return url
else:
url = requests.get(url).url
print url
return url
except Exception as e:
print(e)
The expected redirect does not appear to be well-formed according to requests:
import requests
response = requests.get('http://chollo.to/675za')
for resp in response.history:
print(resp.status_code, resp.url)
print(response.url)
print(response.is_redirect)
Output:
301 http://chollo.to/675za
http://web.epartner.es/click.asp?ref=754218&site=14010&type=text&tnb=39&diurl=https%3A%2F%2Fad.doubleclick.net%2Fddm%2Fclk%2F302111021%3B129203261%3By%3Fhttp%3A%2F%2Fwww.elcorteingles.es%2Flimite-48-horas%2Fequipaje%2F%3Fsorting%3DpriceAsc%26aff_id%3D2118094
False
This is likely intentional by epartner or doubleclick. For these types of nested urls you would need an extra step like:
from urllib.parse import unquote
# from urllib import unquote # python2
# if response.url.count('http') > 1:
url = 'http' + response.url.split('http')[-1]
unquote(url)
# http://www.elcorteingles.es/limite-48-horas/equipaje/?sorting=priceAsc&aff_id=2118094
Note: by doing this you might be avoiding intended ad revenues.
I'm writing a web scraper and basically what I'm working with using requests and bs4 is a site that provides all content in the style https://downlaod.domain.com/xid_39428423_1 which then redirects you to the actual file. What I want is a command which fetches the redirect link before downloading the file, so I can check if I've already downloaded said file. The current code snippet I have is this:
def download_file(file_url,s,thepath):
if not os.path.isdir(thepath):
os.makedirs(thepath)
print 'getting header'
i = s.head(file_url)
urlpath = i.url
name = urlsplit(urlpath)[2].split('/')
name = name[len(name)-1]
if not os.path.exists(thepath + name):
print urlpath
i = s.get(urlpath)
if i.status_code == requests.codes.ok:
with iopen(thepath + name, 'wb') as file:
file.write(i.content)
else:
return False
If I change the s.head to s.get it works, but it downloads the file twice. Is there any way to get the redirected url without downloading?
SOLVED
The final code looks like this, thanks!
def download_file(file_url,s,thepath):
if not os.path.isdir(thepath):
os.makedirs(thepath)
print 'getting header'
i = s.get(file_url, allow_redirects=False)
if i.status_code == 302:
urlpath = i.headers['location']
else:
urlpath = file_url
name = urlsplit(urlpath)[2].split('/')
name = name[len(name)-1]
if not os.path.exists(thepath + name):
print urlpath
i = s.get(urlpath)
if i.status_code == requests.codes.ok:
with iopen(thepath + name, 'wb') as file:
file.write(i.content)
else:
return False
You could use the allow_redirects flag and set it to False (see the documentation). That way the .get() will not follow the redirect, which allows you to inspect the response before retrieving the file itself.
In other words, instead of this:
i = s.head(file_url)
urlpath = i.url
You could write:
i = s.get(file_url, allow_redirects=False)
urlpath = i.headers['location']
I am attempting to create a bot that fetches market links from steam but have run into a problem. I was able to return all the data from a single page, but when I attempt to get multiple pages it just gives me copies of the first page though I give it working links (eg: http://steamcommunity.com/market/search?q=appid%3A753#p1 and then http://steamcommunity.com/market/search?q=appid%3A753#p2). I have tested the links and they work in my browser. This is my code.
import urllib2
import random
import time
start_url = "http://steamcommunity.com/market/search?q=appid%3A753"
end_page = 3
urls = []
def get_raw(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
def get_market_urls(html):
index = 0
while index != -1:
index = html.find("market_listing_row_link", index+25)
beg = html.find("http", index)
end = html.find('"',beg)
print html[beg:end]
urls.append(html[beg:end])
def go_to_page(page):
return start_url+"#p"+str(page)
def wait(min, max):
wait_t = random.randint(min,max)
time.sleep(wait_t)
for i in range(end_page):
url = go_to_page(i+1)
raw = get_raw(url)
get_market_urls(raw)
Your problem is that you've misunderstood what the URL says.
The number after the hashtag doesn't mean it's a different URL that can be fetched. This is called the query string. In that particular page the query string explains to the javascript which page to pull off AJAX. (Read about it Here and Here if you're interested..).
Anyway, you shoul look at the url: http://steamcommunity.com/market/search/render/?query=appid%3A753&start=00&count=10. You can play with the start=00&count=10 parameters to get the results you want.
Enjoy.