I have the following session-dependent code which must be run continuously.
Code
import requests
http = requests.Session()
while True:
# if http is not good, then run http = requests.Session() again
response = http.get(....)
# process respons
# wait for 5 seconds
Note: I moved the line http = requests.Session() out of the loop.
Issue
How to check if the session is working?
An example for a not working session may be after the web server is restarted. Or loadbalancer redirects to a different web server.
The requests.Session object is just a persistence and connection-pooling object to allow shared state between different HTTP request on the client-side.
If the server unexpectedly closes a session, so that it becomes invalid, the server probably would respond with some error-indicating HTTP status code.
Thus requests would raise an error. See Errors and Exceptions:
All exceptions that Requests explicitly raises inherit from requests.exceptions.RequestException.
See the extended classes of RequestException.
Approach 1: implement open/close using try/except
Your code can catch such exceptions within a try/except-block.
It depends on the server's API interface specification how it will signal a invalidated/closed session. This signal response should be evaluated in the except block.
Here we use session_was_closed(exception) function to evaluate the exception/response and Session.close() to close the session correctly before opening a new one.
import requests
# initially open a session object
s = requests.Session()
# execute requests continuously
while True:
try:
response = s.get(....)
# process response
except requests.exceptions.RequestException as e:
if session_was_closed(e):
s.close() # close the session
s = requests.Session() # opens a new session
else:
# process non-session-related errors
# wait for 5 seconds
Depending on the server response of your case, implement the method session_was_closed(exception).
Approach 2: automatically open/close using with
From Advanced Usage, Session Objects:
Sessions can also be used as context managers:
with requests.Session() as s:
s.get('https://httpbin.org/cookies/set/sessioncookie/123456789')
This will make sure the session is closed as soon as the with block is exited, even if unhandled exceptions occurred.
I would flip the logic and add a try-except.
import requests
http = requests.Session()
while True:
try:
response = http.get(....)
except requests.ConnectionException:
http = requests.Session()
continue
# process respons
# wait for 5 seconds
See this answer for more info. I didn't test if the raised exception is that one, so please test it.
Related
I expected calling close() on a Session object to close the session. But looks like that's not happening. Am I missing something?
import requests
s = requests.Session()
url = 'https://google.com'
r = s.get(url)
s.close()
print("s is closed now")
r = s.get(url)
print(r)
output:
s is closed now
<Response [200]>
The second call to s.get() should have given an error.
Inside the implementation for Session.close() we can find that:
def close(self):
"""Closes all adapters and as such the session"""
for v in self.adapters.values():
v.close()
And inside the adapter.close implementation:
def close(self):
"""Disposes of any internal state.
Currently, this closes the PoolManager and any active ProxyManager,
which closes any pooled connections.
"""
self.poolmanager.clear()
for proxy in self.proxy_manager.values():
proxy.clear()
So what I could make out is that, it clears the state of the Session object. So in case you have logged in to some site and have some stored cookies in the Session, then these cookies will be removed once you use the session.close() method. The inner functions still remain functional though.
You can use a Context Manager to auto-close it:
import requests
with requests.Session() as s:
url = 'https://google.com'
r = s.get(url)
See requests docs > Sessions:
This will make sure the session is closed as soon as the with block is exited, even if unhandled exceptions occurred.
I've been developing an application, where I need to handle temporarily disconnects on the client (network interface goes down).
I initially thought the below approach would work, but sometimes if restart the network interface, the s.get(url) call would hang indefinitely:
s = requests.Session()
s.mount('http://stackoverflow.com', HTTPAdapter(max_retries=Retry(total=10, connect=10, read=10)))
s.get(url)
By adding the timeout=10 keyword argument to s.get(url), the code is now able to handle this blocking behavior:
s = requests.Session()
s.mount('http://stackoverflow.com', HTTPAdapter(max_retries=Retry(total=10, connect=10, read=10)))
s.get(url, timeout=10)
Why is a timeout necessary to handle the cases, where a network interface resets or goes down temporarily? Why is max_retries=Retry(total=10, connect=10, read=10) not able to handle this? In particular, why is s.get() not informed that the network interface went offline, so that it could retry the connection instead of hanging?
Try:
https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html#urllib3.util.retry.Retry
from requests.adapters import HTTPAdapter
s = requests.Session()
s.mount('http://stackoverflow.com', HTTPAdapter(max_retries=5))
Or:
retries = Retry(connect=5, read=2, redirect=5)
http = PoolManager(retries=retries)
response = http.request('GET', 'http://stackoverflow.com')
Or:
response = http.request('GET', 'http://stackoverflow.com', retries=Retry(10))
What's the correct way to use HTTPAdapter with Async programming and calling out to a method? All of these requests are being made to the same domain.
I'm doing some async programming in Celery using eventlet and testing the load on one of my sites. I have a method that I call out to which makes the request to the url.
def get_session(url):
# gets session returns source
headers, proxies = header_proxy()
# set all of our necessary variables to None so that in the event of an error
# we can make sure we dont break
response = None
status_code = None
out_data = None
content = None
try:
# we are going to use request-html to be able to parse the
# data upon the initial request
with HTMLSession() as session:
# you can swap out the original request session here
# session = requests.session()
# passing the parameters to the session
session.mount('https://', HTTPAdapter(max_retries=0, pool_connections=250, pool_maxsize=500))
response = session.get(url, headers=headers, proxies=proxies)
status_code = response.status_code
try:
# we are checking to see if we are getting a 403 error on all requests. If so,
# we update the status code
code = response.html.xpath('''//*[#id="accessDenied"]/p[1]/b/text()''')
if code:
status_code = str(code[0][:-1])
else:
pass
except Exception as error:
pass
# print(error)
# assign the content to content
content = response.content
except Exception as error:
print(error)
pass
If I leave out the pool_connections and pool_maxsize parameters, and run the code, I get an error indicating that I do not have enough open connections. However, I don't want to unnecessarily open up a large number of connections if I dont need to.
based on this... https://laike9m.com/blog/requests-secret-pool_connections-and-pool_maxsize,89/ Im going to guess that this applies to the host and not so much the async task. Therefore, I set the max number to the max number of connections that can be reused per host. If I hit a domain several times, the connection is reused.
Currently taking a web scraping class with other students, and we are supposed to make ‘get’ requests to a dummy site, parse it, and visit another site.
The problem is, the content of the dummy site is only up for several minutes and disappears, and the content comes back up at a certain interval. During the time the content is available, everyone tries to make the ‘get’ requests, so mine just hangs until everyone clears up, and the content eventually disappears. So I end up not being able to successfully make the ‘get’ request:
import requests
from splinter import Browser
browser = Browser('chrome')
# Hangs here
requests.get('http://dummysite.ca').text
# Even if get is successful hangs here as well
browser.visit(parsed_url)
So my question is, what's the fastest/best way to make endless concurrent 'get' requests until I get a response?
Decide to use either requests or splinter
Read about Requests: HTTP for Humans
Read about Splinter
Related
Read about keep-alive
Read about blocking-or-non-blocking
Read about timeouts
Read about errors-and-exceptions
If you are able to get not hanging requests, you can think of repeated requests, for instance:
while True:
requests.get(...
if request is succesfull:
break
time.sleep(1)
Gevent provides a framework for running asynchronous network requests.
It can patch Python's standard library so that existing libraries like requests and splinter work out of the box.
Here is a short example of how to make 10 concurrent requests, based on the above code, and get their response.
from gevent import monkey
monkey.patch_all()
import gevent.pool
import requests
pool = gevent.pool.Pool(size=10)
greenlets = [pool.spawn(requests.get, 'http://dummysite.ca')
for _ in range(10)]
# Wait for all requests to complete
pool.join()
for greenlet in greenlets:
# This will raise any exceptions raised by the request
# Need to catch errors, or check if an exception was
# thrown by checking `greenlet.exception`
response = greenlet.get()
text_response = response.text
Could also use map and a response handling function instead of get.
See gevent documentation for more information.
In this situation, concurrency will not help much since the server seems to be the limiting factor. One solution is to send a request with a timeout interval, if the interval has exceeded, then try the request again after a few seconds. Then gradually increase the time between retries until you get the data that you want. For instance, your code might look like this:
import time
import requests
def get_content(url, timeout):
# raise Timeout exception if more than x sends have passed
resp = requests.get(url, timeout=timeout)
# raise generic exception if request is unsuccessful
if resp.status_code != 200:
raise LookupError('status is not 200')
return resp.content
timeout = 5 # seconds
retry_interval = 0
max_retry_interval = 120
while True:
try:
response = get_content('https://example.com', timeout=timeout)
retry_interval = 0 # reset retry interval after success
break
except (LookupError, requests.exceptions.Timeout):
retry_interval += 10
if retry_interval > max_retry_interval:
retry_interval = max_retry_interval
time.sleep(retry_interval)
# process response
If concurrency is required, consider the Scrapy project. It uses the Twisted framework. In Scrapy you can replace time.sleep with reactor.callLater(fn, *args, **kw) or use one of hundreds of middleware plugins.
From the documentation for requests:
If the remote server is very slow, you can tell Requests to wait
forever for a response, by passing None as a timeout value and then
retrieving a cup of coffee.
import requests
#Wait potentially forever
r = requests.get('http://dummysite.ca', timeout=None)
#Check the status code to see how the server is handling the request
print r.status_code
Status codes beginning with 2 mean the request was received, understood, and accepted. 200 means the request was a success and the information returned. But 503 means the server is overloaded or undergoing maintenance.
Requests used to include a module called async which could send concurrent requests. It is now an independent module named grequests
which you can use to make concurrent requests endlessly until a 200 response:
import grequests
urls = [
'http://python-requests.org', #Just include one url if you want
'http://httpbin.org',
'http://python-guide.org',
'http://kennethreitz.com'
]
def keep_going():
rs = (grequests.get(u) for u in urls) #Make a set of unsent Requests
out = grequests.map(rs) #Send them all at the same time
for i in out:
if i.status_code == 200:
print i.text
del urls[out.index(i)] #If we have the content, delete the URL
return
while urls:
keep_going()
Using the HTTPretty library for Python, I can create mock HTTP responses of choice and then pick them up i.e. with the requests library like so:
import httpretty
import requests
# set up a mock
httpretty.enable()
httpretty.register_uri(
method=httpretty.GET,
uri='http://www.fakeurl.com',
status=200,
body='My Response Body'
)
response = requests.get('http://www.fakeurl.com')
# clean up
httpretty.disable()
httpretty.reset()
print(response)
Out: <Response [200]>
Is there also the possibility to register an uri which cannot be reached (e.g. connection timed out, connection refused, ...) such that no response is received at all (which is not the same as an established connection which gives an HTTP error code like 404)?
I want to use this behaviour in unit testing to ensure that my error handling works as expected (which does different things in case of 'no connection established' and 'connection established, bad bad HTTP status code'). As a workaround, I could try to connect to an invalid server like http://192.0.2.0 which would time out in any case. However, I would prefer to do all my unit testing without using any real network connections.
Meanwhile I got it, using a HTTPretty callback body seems to produce the desired behaviour. See inline comments below.
This is actually not exactly the same as I was looking for (it is not a server that cannot be reached and hence the request times out but a server that throws a timeout exception once it is reached, however, the effect is the same for my usecase.
Still, if anybody knows a different solution, I'm looking forward to it.
import httpretty
import requests
# enable HTTPretty
httpretty.enable()
# create a callback body that raises an exception when opened
def exceptionCallback(request, uri, headers):
# raise your favourite exception here, e.g. requests.ConnectionError or requests.Timeout
raise requests.Timeout('Connection timed out.')
# set up a mock and use the callback function as response's body
httpretty.register_uri(
method=httpretty.GET,
uri='http://www.fakeurl.com',
status=200,
body=exceptionCallback
)
# try to get a response from the mock server and catch the exception
try:
response = requests.get('http://www.fakeurl.com')
except requests.Timeout as e:
print('requests.Timeout exception got caught...')
print(e)
# do whatever...
# clean up
httpretty.disable()
httpretty.reset()