I am trying to unshort a lot of URLs which I have in a urlSet. The following code works most of the time. But some times it takes a very long time to finish. For example I have 2950 in urlSet. stderr tells me that 2900 is done, but getUrlMapping does not finish.
def getUrlMapping(urlSet):
# get the url mapping
urlMapping = {}
#rs = (grequests.get(u) for u in urlSet)
rs = (grequests.head(u) for u in urlSet)
res = grequests.imap(rs, size = 100)
counter = 0
for x in res:
counter += 1
if counter % 50 == 0:
sys.stderr.write('Doing %d url_mapping length %d \n' %(counter, len(urlMapping)))
urlMapping[ getOriginalUrl(x) ] = getGoalUrl(x)
return urlMapping
def getGoalUrl(resp):
url=''
try:
url = resp.url
except:
url = 'NULL'
return url
def getOriginalUrl(resp):
url=''
try:
url = resp.history[0].url
except IndexError:
url = resp.url
except:
url = 'NULL'
return url
Probably it won't help you as it has passed a long time but still..
I was having some issues with Requests, similar to the ones you are having. To me the problem was that Requests took ages to download some pages, but using any other software (browsers, curl, wget, python's urllib) everything worked fine...
Afer a LOT of time wasted, I noticed that the server was sending some invalid headers, for example, in one of the "slow" pages, after Content-type: text/html it began to send header in the form Header-name : header-value (notice the space before the colon). This somehow breaks Python's email.header functionality used to parse HTTP headers by Requests so the Transfer-encoding: chunked header wasn't being parsed.
Long story short: manually setting the chunked property to True of Response objects before asking for the content solved the issue. For example:
response = requests.get('http://my-slow-url')
print(response.text)
took ages but
response = requests.get('http://my-slow-url')
response.raw.chunked = True
print(response.text)
worked great!
Related
Redirect latency: This time is between 301 and 200 or maybe the time between two redirects is 301.
Like a Chrome extension:
How to measure redirection latency using the requests library?
import requests
r = 'http://httpbin.org/redirect/3'
r = requests.head(r, allow_redirects=True, stream=True)
r.elapsed
r.elapsed is not what I need, it shows the time between sending the request and the first content it receives.
Option 1
Using the requests library, you would have to set the allow_redirects parameter to False and have a loop where you look up for the Location response header (which indicates the URL to redirect a page to), then perform a new request to that URL, and finally, measure the total elapsed time.
However, you may find it easier to do that with httpx library, which is very similar to Python requests, but with more features. You can set the follow_redirects parameter to False (which is the default value anyway) and use the .next_request attribue of the Response object to get the redirection URL into an already constructed Request object. As described earlier, you can then have a loop to send each request and measure their response time (= transport latency + processing time), separately. The response.elapsed returns a timedelta object with the time elapsed from sending the request to the arrival of the response. By adding all the response times together, you can measure the total time elapsed. Example:
import httpx
url = 'http://httpbin.org/redirect/3'
with httpx.Client() as client:
r = client.get(url, follow_redirects=False)
print(r.url, r.elapsed, '', sep='\n')
total = r.elapsed.total_seconds()
while 300 < r.status_code < 400:
r = client.send(r.next_request)
print(r.url, r.elapsed, '', sep='\n')
total += r.elapsed.total_seconds()
print(f'Total time elapsed: {total} s')
Option 2
Set the follow_redirects parameter to True and use the .history attribute of the Response object to get a list of responses that led to the final URL. The .history property contains a list of any redirect responses that were followed, in the order in which they were made. You can measure the elapsed time for each request, as well as the total elapsed time, as demonstrated in Option 1 above. Example:
import httpx
url = 'http://httpbin.org/redirect/3'
with httpx.Client() as client:
r = client.get(url, follow_redirects=True)
total = 0
if r.history:
for resp in r.history:
print(resp.url, resp.elapsed, '', sep='\n')
total += resp.elapsed.total_seconds()
print(r.url, r.elapsed, '', sep='\n')
total += r.elapsed.total_seconds()
print(f'Total time elapsed: {total} s')
In Python requests (instead of httpx), the above approach would be as follows:
import requests
url = 'http://httpbin.org/redirect/3'
with requests.Session() as session:
r = session.get(url, allow_redirects=True)
total = 0
if r.history:
for resp in r.history:
print(resp.url, resp.elapsed, '', sep='\n')
total += resp.elapsed.total_seconds()
print(r.url, r.elapsed, '', sep='\n')
total += r.elapsed.total_seconds()
print(f'Total time elapsed: {total} s')
Note
If you would like to ignore the body from the response (as you might not need it, and hence, you wouldn't want to have it loaded into your RAM, especially when using Option 2, where every response is saved to history), you can use HEAD instead of GET request. Examples are given below.
Using httpx:
r = client.head(url, ...
Using requests:
r = session.head(url, ...
I've been using requests.Session() to make web requests with authentication. Maybe 70% of the time I'll get a status_code of 200, but I also sporadically get 401.
Since I'm using a session - I'm absolutely positive that the credentials are correct - given that the same exact request when repeated may return 200.
Some further details:
I'm working with the SharePoint REST API
I'm using NTLM Authentication
To circumvent the problem, I've tried writing a loop that will sleep for a few seconds and retry the request. The odd thing here is that I haven't seen this actually recover - instead if the first request fails, then all subsequent requests will fail too. But if I just try again - the request may succeed on the first try.
Please note that I've already reviewed this question, but the suggestion is to use requests.Session(), which I'm already doing and still receiving 401s.
Here's some code to demonstrate what I've tried so far.
import requests
from requests_ntlm import HttpNtlmAuth
from urllib.parse import quote
# Establish requests session
s = requests.Session()
s.auth = HttpNtlmAuth(username, password)
# Update the request header to request JSON formatted output
s.headers.update({'Content-Type': 'application/json; odata=verbose',
'accept': 'application/json;odata=verbose'})
def RetryLoop(req, max_tries = 5):
''' Takes in a request object and will retry the request
upon failure up the the specified number of maximum
retries.
Used because error codes occasionally surface even though the
REST API call is formatted correctly. Exception returns status code
and text. Success returns request object.
Default max_tries = 5
'''
# Call fails sometimes - allow 5 retries
counter = 0
# Initialize loop
while True:
# Hit the URL
r = req
# Return request object on success
if r.status_code == 200:
return r
# If limit reached then raise exception
counter += 1
if counter == max_tries:
print(f"Failed to connect. \nError code = {r.status_code}\nError text: {r.text}")
# Message for failed retry
print(f'Failed request. Error code: {r.status_code}. Trying again...')
# Spacing out the requests in case of a connection problem
time.sleep(5)
r = RetryLoop(s.get("https://my_url.com"))
I've additionally tried creating a new session within the retry loop - but that hasn't seemed to help either. And I thought 5 seconds of sleep should be sufficient if it's a temporary block from the site, because I've retried myself in much less time and gotten the expected 200. I would expect to see a failure or two, and then a success.
Is there an underlying problem that I'm missing? And is there a more proper what that I can re-attempt the request given a 401?
** EDIT: #Swadeep pointed out the issue - by passing in the request to the function it's only calling the request once. Updated code that works properly:
def RetryLoop(req, max_tries = 5):
''' Takes in a request object and will retry the request
upon failure up the the specified number of maximum
retries.
Used because error codes occasionally surface even though the
REST API call is formatted correctly. Exception returns status code
and text. Success returns request object.
Default max_tries = 5
'''
# Call fails sometimes - allow 5 retries
counter = 0
# Initialize loop
while True:
# Return request object on success
if req.status_code == 200:
return req
# If limit reached then raise exception
counter += 1
if counter == max_tries:
print(f"Failed to connect. \nError code = {req.status_code}\nError text: {req.text}")
# Message for failed retry
print(f'Failed request. Error code: {req.status_code}. Trying again...')
# Spacing out the requests in case of a connection problem
time.sleep(1)
req = s.get(req.url)
This is what I propose.
import requests
from requests_ntlm import HttpNtlmAuth
from urllib.parse import quote
# Establish requests session
s = requests.Session()
s.auth = HttpNtlmAuth(username, password)
# Update the request header to request JSON formatted output
s.headers.update({'Content-Type': 'application/json; odata=verbose', 'accept': 'application/json;odata=verbose'})
def RetryLoop(s, max_tries = 5):
'''Takes in a request object and will retry the request
upon failure up the the specified number of maximum
retries.
Used because error codes occasionally surface even though the
REST API call is formatted correctly. Exception returns status code
and text. Success returns request object.
Default max_tries = 5
'''
# Call fails sometimes - allow 5 retries
counter = 0
# Initialize loop
while True:
# Hit the URL
r = s.get("https://my_url.com")
# Return request object on success
if r.status_code == 200:
return r
# If limit reached then raise exception
counter += 1
if counter == max_tries:
print(f"Failed to connect. \nError code = {r.status_code}\nError text: {r.text}")
# Message for failed retry
print(f'Failed request. Error code: {r.status_code}. Trying again...')
# Spacing out the requests in case of a connection problem
time.sleep(5)
r = RetryLoop(s)
We use a custom scraper that have to take a separate website for a language (this is an architecture limitation). Like site1.co.uk, site1.es, site1.de etc.
But we need to parse a website with many languages, separated by url - like site2.com/en, site2.com/de, site2.com/es and so on.
I thought about MITMProxy: I could redirect all requests this way:
en.site2.com/* --> site2.com/en
de.site2.com/* --> site2.com/de
...
I have written a small script which simply takes URLs and rewrites them:
class MyMaster(flow.FlowMaster):
def handle_request(self, r):
url = r.get_url()
# replace URLs
if 'blabla' in url:
r.set_url(url.replace('something', 'another'))
But the target host generates 301 redirect with the response from the webserver - 'the page has been moved here' and the link to the site2.com/en
It worked when I played with URL rewriting, i.e. site2.com/en --> site2.com/de.
But for different hosts (subdomain and the root domain, to be precise), it does not work.
I tried to replace the Host header in the handle_request method from above:
for key in r.headers.keys():
if key.lower() == 'host':
r.headers[key] = ['site2.com']
also I tried to replace the Referrer - all of that didn't help.
How can I finally spoof that request from the subdomain to the main domain? If it generates a HTTP(s) client warning it's ok since we need that for the scraper (and the warnings there can be turned off), not the real browser.
Thanks!
You need to replace the content of the response and craft the header with just a few fields.
Open a new connection to the redirected url and craft your response :
def handle_request(self, flow):
newUrl = <new-url>
retryCount = 3
newResponse = None
while True:
try:
newResponse = requests.get(newUrl) # import requests
except:
if retryCount == 0:
print 'Cannot reach new url ' + newUrl
traceback.print_exc() # import traceback
return
retryCount -= 1
continue
break
responseHeaders = Headers() # from netlib.http import Headers
if 'Date' in newResponse.headers:
responseHeaders['Date'] = str(newResponse.headers['Date'])
if 'Connection' in newResponse.headers:
responseHeaders['Connection'] = str(newResponse.headers['Connection'])
if 'Content-Type' in newResponse.headers:
responseHeaders['Content-Type'] = str(newResponse.headers['Content-Type'])
if 'Content-Length' in newResponse.headers:
responseHeaders['Content-Length'] = str(newResponse.headers['Content-Length'])
if 'Content-Encoding' in newResponse.headers:
responseHeaders['Content-Encoding'] = str(inetResponse.headers['Content-Encoding'])
response = HTTPResponse( # from libmproxy.models import HTTPResponse
http_version='HTTP/1.1',
status_code=200,
reason='OK',
headers=responseHeaders,
content=newResponse.content)
flow.reply(response)
I am using gevent to download some html pages.
Some websites are way too slow, some stop serving requests after period of time. That is why I had to limit total time for a group of requests I make. For that I use gevent "Timeout".
timeout = Timeout(10)
timeout.start()
def downloadSite():
# code to download site's url one by one
url1 = downloadUrl()
url2 = downloadUrl()
url3 = downloadUrl()
try:
gevent.spawn(downloadSite).join()
except Timeout:
print 'Lost state here'
But the problem with it is that i loose all the state when exception fires up.
Imagine I crawl site 'www.test.com'. I have managed to download 10 urls right before site admins decided to switch webserver for maintenance. In such case i will lose information about crawled pages when exception fires up.
The question is - how do I save state and process the data even if Timeout happens ?
Why not try something like:
timeout = Timeout(10)
def downloadSite(url):
with Timeout(10):
downloadUrl(url)
urls = ["url1", "url2", "url3"]
workers = []
limit = 5
counter = 0
for i in urls:
# limit to 5 URL requests at a time
if counter < limit:
workers.append(gevent.spawn(downloadSite, i))
counter += 1
else:
gevent.joinall(workers)
workers = [i,]
counter = 0
gevent.joinall(workers)
You could also save a status in a dict or something for every URL, or append the ones that fail in a different array, to retry later.
A self-contained example:
import gevent
from gevent import monkey
from gevent import Timeout
gevent.monkey.patch_all()
import urllib2
def get_source(url):
req = urllib2.Request(url)
data = None
with Timeout(2):
response = urllib2.urlopen(req)
data = response.read()
return data
N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]
print contents[5]
It implements one timeout for each request. In this example, contents contains 10 times the HTML source of google.com, each retrieved in an independent request. If one of the requests had timed out, the corresponding element in contents would be None. If you have questions about this code, don't hesitate to ask in the comments.
I saw your last comment. Defining one timeout per request definitely is not wrong from the programming point of view. If you need to throttle traffic to the website, then just don't spawn 100 greenlets simultaneously. Spawn 5, wait until they returned. Then, you can possibly wait for a given amount of time, and spawn the next 5 (already shown in the other answer by Gabriel Samfira, as I see now). For my code above, this would mean, that you would have to repeatedly call
N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]
whereas N should not be too high.
I have seen this thread already - How can I unshorten a URL?
My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshortening youtube links. Since unshort.me is used readily, this returns almost 90% of the results with captchas which I am unable to resolve.
So far I am stuck with using:
def unshorten_url(url):
resolvedURL = urllib2.urlopen(url)
print resolvedURL.url
#t = Test()
#c = pycurl.Curl()
#c.setopt(c.URL, 'http://api.unshort.me/?r=%s&t=xml' % (url))
#c.setopt(c.WRITEFUNCTION, t.body_callback)
#c.perform()
#c.close()
#dom = xml.dom.minidom.parseString(t.contents)
#resolvedURL = dom.getElementsByTagName("resolvedURL")[0].firstChild.nodeValue
return resolvedURL.url
Note: everything in the comments is what I tried to do when using the unshort.me service which was returning captcha links.
Does anyone know of a more efficient way to complete this operation without using open (since it is a waste of bandwidth)?
one line functions, using requests library and yes, it supports recursion.
def unshorten_url(url):
return requests.head(url, allow_redirects=True).url
Use the best rated answer (not the accepted answer) in that question:
# This is for Py2k. For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse
def unshorten_url(url):
parsed = urlparse.urlparse(url)
h = httplib.HTTPConnection(parsed.netloc)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
h.request('HEAD', resource )
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return unshorten_url(response.getheader('Location')) # changed to process chains of short urls
else:
return url
You DO have to open it, otherwise you won't know what URL it will redirect to. As Greg put it:
A short link is a key into somebody else's database; you can't expand the link without querying the database
Now to your question.
Does anyone know of a more efficient way to complete this operation
without using open (since it is a waste of bandwidth)?
The more efficient way is to not close the connection, keep it open in the background, by using HTTP's Connection: keep-alive.
After a small test, unshorten.me seems to take the HEAD method into account and doing a redirect to itself:
> telnet unshorten.me 80
Trying 64.202.189.170...
Connected to unshorten.me.
Escape character is '^]'.
HEAD http://unshort.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp HTTP/1.1
Host: unshorten.me
HTTP/1.1 301 Moved Permanently
Date: Mon, 22 Aug 2011 20:42:46 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Location: http://resolves.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp
Cache-Control: private
Content-Length: 0
So if you use the HEAD HTTP method, instead of GET, you will actually end up doing the same work twice.
Instead, you should keep the connection alive, which will save you only a little bandwidth, but what it will certainly save is the latency of establishing a new connection every time. Establishing a TCP/IP connection is expensive.
You should get away with a number of kept-alive connections to the unshorten service equal to the number of concurrent connections your own service receives.
You could manage these connections in a pool. That's the closest you can get. Beside tweaking your kernel's TCP/IP stack.
Here a src code that takes into account almost of the useful corner cases:
set a custom Timeout.
set a custom User Agent.
check whether we have to use an http or https connection.
resolve recursively the input url and prevent ending within a loop.
The src code is on github # https://github.com/amirkrifa/UnShortenUrl
comments are welcome ...
import logging
logging.basicConfig(level=logging.DEBUG)
TIMEOUT = 10
class UnShortenUrl:
def process(self, url, previous_url=None):
logging.info('Init url: %s'%url)
import urlparse
import httplib
try:
parsed = urlparse.urlparse(url)
if parsed.scheme == 'https':
h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
else:
h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
try:
h.request('HEAD',
resource,
headers={'User-Agent': 'curl/7.38.0'}
)
response = h.getresponse()
except:
import traceback
traceback.print_exec()
return url
logging.info('Response status: %d'%response.status)
if response.status/100 == 3 and response.getheader('Location'):
red_url = response.getheader('Location')
logging.info('Red, previous: %s, %s'%(red_url, previous_url))
if red_url == previous_url:
return red_url
return self.process(red_url, previous_url=url)
else:
return url
except:
import traceback
traceback.print_exc()
return None
import requests
short_url = "<your short url goes here>"
long_url = requests.get(short_url).url
print(long_url)