How can I un-shorten a URL using python?

How can I un-shorten a URL using python? - python

I have seen this thread already - How can I unshorten a URL?
My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshortening youtube links. Since unshort.me is used readily, this returns almost 90% of the results with captchas which I am unable to resolve.
So far I am stuck with using:
def unshorten_url(url):
resolvedURL = urllib2.urlopen(url)
print resolvedURL.url
#t = Test()
#c = pycurl.Curl()
#c.setopt(c.URL, 'http://api.unshort.me/?r=%s&t=xml' % (url))
#c.setopt(c.WRITEFUNCTION, t.body_callback)
#c.perform()
#c.close()
#dom = xml.dom.minidom.parseString(t.contents)
#resolvedURL = dom.getElementsByTagName("resolvedURL")[0].firstChild.nodeValue
return resolvedURL.url
Note: everything in the comments is what I tried to do when using the unshort.me service which was returning captcha links.
Does anyone know of a more efficient way to complete this operation without using open (since it is a waste of bandwidth)?

one line functions, using requests library and yes, it supports recursion.
def unshorten_url(url):
return requests.head(url, allow_redirects=True).url

Use the best rated answer (not the accepted answer) in that question:
# This is for Py2k. For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
h.request('HEAD', resource )
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return unshorten_url(response.getheader('Location')) # changed to process chains of short urls
else:
return url

You DO have to open it, otherwise you won't know what URL it will redirect to. As Greg put it:
A short link is a key into somebody else's database; you can't expand the link without querying the database
Now to your question.
Does anyone know of a more efficient way to complete this operation
without using open (since it is a waste of bandwidth)?
The more efficient way is to not close the connection, keep it open in the background, by using HTTP's Connection: keep-alive.
After a small test, unshorten.me seems to take the HEAD method into account and doing a redirect to itself:
> telnet unshorten.me 80
Trying 64.202.189.170...
Connected to unshorten.me.
Escape character is '^]'.
HEAD http://unshort.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp HTTP/1.1
Host: unshorten.me
HTTP/1.1 301 Moved Permanently
Date: Mon, 22 Aug 2011 20:42:46 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Location: http://resolves.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp
Cache-Control: private
Content-Length: 0
So if you use the HEAD HTTP method, instead of GET, you will actually end up doing the same work twice.
Instead, you should keep the connection alive, which will save you only a little bandwidth, but what it will certainly save is the latency of establishing a new connection every time. Establishing a TCP/IP connection is expensive.
You should get away with a number of kept-alive connections to the unshorten service equal to the number of concurrent connections your own service receives.
You could manage these connections in a pool. That's the closest you can get. Beside tweaking your kernel's TCP/IP stack.

Here a src code that takes into account almost of the useful corner cases:
set a custom Timeout.
set a custom User Agent.
check whether we have to use an http or https connection.
resolve recursively the input url and prevent ending within a loop.
The src code is on github # https://github.com/amirkrifa/UnShortenUrl
comments are welcome ...
import logging
logging.basicConfig(level=logging.DEBUG)
TIMEOUT = 10
class UnShortenUrl:
def process(self, url, previous_url=None):
logging.info('Init url: %s'%url)
import urlparse
import httplib
try:
parsed = urlparse.urlparse(url)
if parsed.scheme == 'https':
h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
else:
h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
try:
h.request('HEAD',
resource,
headers={'User-Agent': 'curl/7.38.0'}
)
response = h.getresponse()
except:
import traceback
traceback.print_exec()
return url
logging.info('Response status: %d'%response.status)
if response.status/100 == 3 and response.getheader('Location'):
red_url = response.getheader('Location')
logging.info('Red, previous: %s, %s'%(red_url, previous_url))
if red_url == previous_url:
return red_url
return self.process(red_url, previous_url=url)
else:
return url
except:
import traceback
traceback.print_exc()
return None

import requests
short_url = "<your short url goes here>"
long_url = requests.get(short_url).url
print(long_url)

Related

Python requests session does not rotate proxies

I am using private rotating proxy provided by (https://proxy.webshare.io/proxy/rotating?) in which each request to rotating proxy receives a new IP address. when I am using
requests.get('https://httpbin.org/get', headers=headers, proxies=get_proxy())
it returns a new IP each time whenever I make request. but when using
session = requests.Session()
session.headers = headers
session.proxies = get_proxy()
session.get('https://httpbin.org/get')
It returns same IP each time whenever I make request.
How does session object behaves different from requests.get() function in case of proxies.

Session uses previously set up variables/values for each subsequent request, like Cookies. If you want to change the proxy for each request in the session, then use Prepared Requests to set it each time or just put it in a function:
def send(session, url):
return session.get(url, proxy=get_proxy())
sess = requests.Session()
sess.headers = headers
resp = send(sess, 'https://httpbin.org/get')
print(resp.status_code)
But if you're trying to hide your origin IP for scraping or something, you probably don't want to persist cookies, etc. so you shouldn't use sessions.

The following code works, and it take a proxylistfile.txt file to check every proxy:
from requests import *
import bs4
import sys
if len(sys.argv) < 2:
print('Usage: ./testproxy.py <proxylistfile.txt>')
sys.exit()
ifco = 'http://ifconfig.co'
PROXIES_FILE = sys.argv[1]
proxy = dict()
with open(PROXIES_FILE) as file:
for line in file:
if line[0] == '#' or line == "\n":
continue
line_parts = line.replace('\n', '').split(':')
proxy['http'] = f'{line_parts[0]}://{line_parts[1]}:{line_parts[2]}'
try:
i = get(ifco, proxies=proxy, timeout=11)
print(f"{proxy['http']} - successfull - IP ---> ", end='')
zu = bs4.BeautifulSoup(i.text, 'html.parser')
testo = zu.findAll('p', text=True)[0].get_text()
print(testo)
except:
print(f"{proxy['http']} - unsuccessfull")
pass
It connect ot ifconfig.co site and return its real ip to check if the proxy works.
The output will be something like:
http://proxy:port - successfull - IP ---> your.real.ip
the input file format should be like:
http:1.1.1.1:3128

I finally switch to another rotating proxy provider (https://www.proxyegg.com) and the issue has been resolved now.

How to check that url is a valid image source using urllib2? [duplicate]

In python, how would I check if a url ending in .jpg exists?
ex:
http://www.fakedomain.com/fakeImage.jpg
thanks

The code below is equivalent to tikiboy's answer, but using a high-level and easy-to-use requests library.
import requests
def exists(path):
r = requests.head(path)
return r.status_code == requests.codes.ok
print exists('http://www.fakedomain.com/fakeImage.jpg')
The requests.codes.ok equals 200, so you can substitute the exact status code if you wish.
requests.head may throw an exception if server doesn't respond, so you might want to add a try-except construct.
Also if you want to include codes 301 and 302, consider code 303 too, especially if you dereference URIs that denote resources in Linked Data. A URI may represent a person, but you can't download a person, so the server will redirect you to a page that describes this person using 303 redirect.

>>> import httplib
>>>
>>> def exists(site, path):
... conn = httplib.HTTPConnection(site)
... conn.request('HEAD', path)
... response = conn.getresponse()
... conn.close()
... return response.status == 200
...
>>> exists('http://www.fakedomain.com', '/fakeImage.jpg')
False
If the status is anything other than a 200, the resource doesn't exist at the URL. This doesn't mean that it's gone altogether. If the server returns a 301 or 302, this means that the resource still exists, but at a different URL. To alter the function to handle this case, the status check line just needs to be changed to return response.status in (200, 301, 302).

thanks for all the responses everyone, ended up using the following:
try:
f = urllib2.urlopen(urllib2.Request(url))
deadLinkFound = False
except:
deadLinkFound = True

Looks like http://www.fakedomain.com/fakeImage.jpg automatically redirected to http://www.fakedomain.com/index.html without any error.
Redirecting for 301 and 302 responses are automatically done without giving any response back to user.
Please take a look HTTPRedirectHandler, you might need to subclass it to handle that.
Here is the one sample from Dive Into Python:
http://diveintopython3.ep.io/http-web-services.html#redirects

There are problems with the previous answers when the file is in ftp server (ftp://url.com/file), the following code works when the file is in ftp, http or https:
import urllib2
def file_exists(url):
request = urllib2.Request(url)
request.get_method = lambda : 'HEAD'
try:
response = urllib2.urlopen(request)
return True
except:
return False

Try it with mechanize:
import mechanize
br = mechanize.Browser()
br.set_handle_redirect(False)
try:
br.open_novisit('http://www.fakedomain.com/fakeImage.jpg')
print 'OK'
except:
print 'KO'

This might be good enough to see if a url to a file exists.
import urllib
if urllib.urlopen('http://www.fakedomain.com/fakeImage.jpg').code == 200:
print 'File exists'

in Python 3.6.5:
import http.client
def exists(site, path):
connection = http.client.HTTPConnection(site)
connection.request('HEAD', path)
response = connection.getresponse()
connection.close()
return response.status == 200
exists("www.fakedomain.com", "/fakeImage.jpg")
In Python 3, the module httplib has been renamed to http.client
And you need remove the http:// and https:// from your URL, because the httplib is considering : as a port number and the port number must be numeric.

Python3
import requests
def url_exists(url):
"""Check if resource exist?"""
if not url:
raise ValueError("url is required")
try:
resp = requests.head(url)
return True if resp.status_code == 200 else False
except Exception as e:
return False

The answer of #z3moon was good, but I think it is for py 2.x. For python 3.x, you may want to add request to the module call.
import urllib
def check_valid_URLs(url) -> bool:
try:
if urllib.request.urlopen(url).code == 200:
return True
else:
return False
except:
return False

I think you can try send a http request to the url and read the response.If no exception was caught,it probably exists.

MITMProxy: smart URL replacement

We use a custom scraper that have to take a separate website for a language (this is an architecture limitation). Like site1.co.uk, site1.es, site1.de etc.
But we need to parse a website with many languages, separated by url - like site2.com/en, site2.com/de, site2.com/es and so on.
I thought about MITMProxy: I could redirect all requests this way:
en.site2.com/* --> site2.com/en
de.site2.com/* --> site2.com/de
...
I have written a small script which simply takes URLs and rewrites them:
class MyMaster(flow.FlowMaster):
def handle_request(self, r):
url = r.get_url()
# replace URLs
if 'blabla' in url:
r.set_url(url.replace('something', 'another'))
But the target host generates 301 redirect with the response from the webserver - 'the page has been moved here' and the link to the site2.com/en
It worked when I played with URL rewriting, i.e. site2.com/en --> site2.com/de.
But for different hosts (subdomain and the root domain, to be precise), it does not work.
I tried to replace the Host header in the handle_request method from above:
for key in r.headers.keys():
if key.lower() == 'host':
r.headers[key] = ['site2.com']
also I tried to replace the Referrer - all of that didn't help.
How can I finally spoof that request from the subdomain to the main domain? If it generates a HTTP(s) client warning it's ok since we need that for the scraper (and the warnings there can be turned off), not the real browser.
Thanks!

You need to replace the content of the response and craft the header with just a few fields.
Open a new connection to the redirected url and craft your response :
def handle_request(self, flow):
newUrl = <new-url>
retryCount = 3
newResponse = None
while True:
try:
newResponse = requests.get(newUrl) # import requests
except:
if retryCount == 0:
print 'Cannot reach new url ' + newUrl
traceback.print_exc() # import traceback
return
retryCount -= 1
continue
break
responseHeaders = Headers() # from netlib.http import Headers
if 'Date' in newResponse.headers:
responseHeaders['Date'] = str(newResponse.headers['Date'])
if 'Connection' in newResponse.headers:
responseHeaders['Connection'] = str(newResponse.headers['Connection'])
if 'Content-Type' in newResponse.headers:
responseHeaders['Content-Type'] = str(newResponse.headers['Content-Type'])
if 'Content-Length' in newResponse.headers:
responseHeaders['Content-Length'] = str(newResponse.headers['Content-Length'])
if 'Content-Encoding' in newResponse.headers:
responseHeaders['Content-Encoding'] = str(inetResponse.headers['Content-Encoding'])
response = HTTPResponse( # from libmproxy.models import HTTPResponse
http_version='HTTP/1.1',
status_code=200,
reason='OK',
headers=responseHeaders,
content=newResponse.content)
flow.reply(response)

WorkEtc API Python Script

I'm trying to make a universal script in Python that can be used by anybody to import/export all sorts of information from/to Work Etc CRM platform. It has all the documentation here: http://admin.worketc.com/xml.
However, I am now a bit stuck. Authentication works, I can call different API methods, but only the ones without parameters. I am new to Python and that's why I can't figure out how to pass the parameters onto that specific method in the API. Specifically I need to export all time sheets. I'm trying to call this method specifically: http://admin.worketc.com/xml?op=GetDraftTimesheets. For obvious reasons I cannot disclose the login information so it might be a bit hard to test for you.
The code itself:
import xml.etree.ElementTree as ET
import urllib2
import sys
email = 'email#domain.co.uk'
password = 'pass'
#service = 'GetEmployee?EntityID=1658'
#service = 'GetEntryID?EntryID=23354'
#service = ['GetAllCurrenciesWebSafe']
#service = ['GetEntryID', 'EntryID=23354']
service = ['GetDraftTimesheets','2005-08-15T15:52:01+00:00','2014-08-15T15:52:01+00:00' ]
class workEtcUniversal():
sessionkey = None
def __init__(self,url):
if not "http://" in url and not "https://" in url:
url = "http://%s" % url
self.base_url = url
else:
self.base_url = url
def authenticate(self, user, password):
try:
loginurl = self.base_url + email + '&pass=' + password
req = urllib2.Request(loginurl)
response = urllib2.urlopen(req)
the_page = response.read()
root = ET.fromstring(the_page)
sessionkey = root[1].text
print 'Authentication successful!'
try:
f = self.service(sessionkey, service)
except RuntimeError:
print 'Did not perform function!'
except RuntimeError:
print 'Error logging in or calling the service method!'
def service(self, sessionkey, service):
try:
if len(service)<2:
retrieveurl = 'https://domain.worketc.com/xml/' + service[0] + '?VeetroSession=' + sessionkey
else:
retrieveurl = 'https://domain.worketc.com/xml/' + service[0,1,2] + '?VeetroSession=' + sessionkey
except TypeError as err:
print 'Type Error, which means arguments are wrong (or wrong implementation)'
print 'Quitting..'
sys.exit()
try:
responsefile = urllib2.urlopen(retrieveurl)
except urllib2.HTTPError as err:
if err.code == 500:
print 'Internal Server Error: Permission Denied or Object (Service) Does Not Exist'
print 'Quitting..'
sys.exit()
elif err.code == 404:
print 'Wrong URL!'
print 'Quitting..'
sys.exit()
else:
raise
try:
f = open("ExportFolder/worketcdata.xml",'wb')
for line in responsefile:
f.write(line)
f.close()
print 'File has been saved into: ExportFolder'
except (RuntimeError,UnboundLocalError):
print 'Could not write into the file'
client = workEtcUniversal('https://domain.worketc.com/xml/AuthenticateWebSafe?email=')
client.authenticate(email, password)

Writing a code Consuming API requires resolving few questions:
what methods on API are available (get their list with names)
how does a request to such method looks like (find out url, HTTP method to use, requirements to body if used, what headers are expected)
how to build up all the parts to make the request
What methods are available
http://admin.worketc.com/xml lists many of them
How does a request looks like
GetDraftTimesheet is described here http://admin.worketc.com/xml?op=GetDraftTimesheets
and it expects you to create following HTTP request:
POST /xml HTTP/1.1
Host: admin.worketc.com
Content-Type: text/xml; charset=utf-8
Content-Length: length
SOAPAction: "http://schema.veetro.com/GetDraftTimesheets"
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<GetDraftTimesheets xmlns="http://schema.veetro.com">
<arg>
<FromUtc>dateTime</FromUtc>
<ToUtc>dateTime</ToUtc>
</arg>
</GetDraftTimesheets>
</soap:Body>
</soap:Envelope>
Building up the request
The biggest task it to build properly shaped XML document as shown above and having elements FromUtc and ToUtc filled with proper values. I guess, the values shall be in format of ISO datetime, this you shall find yourself.
You shall be able building such an XML by some Python library, I would use lxml.
Note, that the XML document is using namespaces, you have to handle them properly.
Making POST request with all the headers shall be easy. The library you use to make HTTP requests shall fill in properly Content-Length value, but this is mostly done automatically.
Veerto providing many alternative methods
E.g. for "http://admin.worketc.com/xml?op=FindArticlesWebSafe" there is set of different methods for the same service:
SOAP 1.1
SOAP 1.2
HTTP GET
HTTP POST
Depending on your preferences, pick the one which fits your needs.
The simplest is mostly HTTP GET.
For HTTP requests, I would recommend using requests, which are really easy to use, if you get through tutorial, you will understand what I mean.

Python grequests takes a long time to finish

I am trying to unshort a lot of URLs which I have in a urlSet. The following code works most of the time. But some times it takes a very long time to finish. For example I have 2950 in urlSet. stderr tells me that 2900 is done, but getUrlMapping does not finish.
def getUrlMapping(urlSet):
# get the url mapping
urlMapping = {}
#rs = (grequests.get(u) for u in urlSet)
rs = (grequests.head(u) for u in urlSet)
res = grequests.imap(rs, size = 100)
counter = 0
for x in res:
counter += 1
if counter % 50 == 0:
sys.stderr.write('Doing %d url_mapping length %d \n' %(counter, len(urlMapping)))
urlMapping[ getOriginalUrl(x) ] = getGoalUrl(x)
return urlMapping
def getGoalUrl(resp):
url=''
try:
url = resp.url
except:
url = 'NULL'
return url
def getOriginalUrl(resp):
url=''
try:
url = resp.history[0].url
except IndexError:
url = resp.url
except:
url = 'NULL'
return url

Probably it won't help you as it has passed a long time but still..
I was having some issues with Requests, similar to the ones you are having. To me the problem was that Requests took ages to download some pages, but using any other software (browsers, curl, wget, python's urllib) everything worked fine...
Afer a LOT of time wasted, I noticed that the server was sending some invalid headers, for example, in one of the "slow" pages, after Content-type: text/html it began to send header in the form Header-name : header-value (notice the space before the colon). This somehow breaks Python's email.header functionality used to parse HTTP headers by Requests so the Transfer-encoding: chunked header wasn't being parsed.
Long story short: manually setting the chunked property to True of Response objects before asking for the content solved the issue. For example:
response = requests.get('http://my-slow-url')
print(response.text)
took ages but
response = requests.get('http://my-slow-url')
response.raw.chunked = True
print(response.text)
worked great!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I un-shorten a URL using python? - python

one line functions, using requests library and yes, it supports recursion. def unshorten_url(url): return requests.head(url, allow_redirects=True).url

import requests short_url = "<your short url goes here>" long_url = requests.get(short_url).url print(long_url)

Related

Python requests session does not rotate proxies

How to check that url is a valid image source using urllib2? [duplicate]

MITMProxy: smart URL replacement

WorkEtc API Python Script

Python grequests takes a long time to finish

Categories

Resources