socket ResourceWarning using urllib in Python 3 - python

I am using a urllib.request.urlopen() to GET from a web service I'm trying to test.
This returns an HTTPResponse object, which I then read() to get the response body.
But I always see a ResourceWarning about an unclosed socket from socket.py
Here's the relevant function:
from urllib.request import Request, urlopen
def get_from_webservice(url):
""" GET from the webservice """
req = Request(url, method="GET", headers=HEADERS)
with urlopen(req) as rsp:
body = rsp.read().decode('utf-8')
return json.loads(body)
Here's the warning as it appears in the program's output:
$ ./test/test_webservices.py
/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/socket.py:359: ResourceWarning: unclosed <socket.socket object, fd=5, family=30, type=1, proto=6>
self._sock = None
.s
----------------------------------------------------------------------
Ran 2 tests in 0.010s
OK (skipped=1)
If there's anything I can do to the HTTPResponse (or the Request?) to make it close its socket cleanly,
I would really like to know, because this code is for my unit tests; I don't like
ignoring warnings anywhere, but especially not there.

I don't know if this is the answer, but it is part of the way to an answer.
If I add the header "connection: close" to the response from my web services, the HTTPResponse object seems to clean itself up properly without a warning.
And in fact, the HTTP Spec (http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html) says:
HTTP/1.1 applications that do not support persistent connections MUST include the "close" connection option in every message.
So the problem was on the server end (i.e. my fault!). In the event that you don't have control over the headers coming from the server, I don't know what you can do.

I had the same problem with urllib3 and I just added a context manager to close connection automatically:
import urllib3
def get(addr, headers):
""" this function will close the connection after a http request. """
with urllib3.PoolManager() as conn:
res = conn.request('GET', addr, headers=headers)
if r.status == 200:
return res.data
else:
raise ConnectionError(res.reason)
Note that urllib3 is designed to have a pool of connections and to keep connections alive for you. This can significantly speed up your application, if it needs to make a series of requests, e.g. few calls to the backend API.
Please read urllib3 documentation re connection pools here: https://urllib3.readthedocs.io/en/1.5/pools.html
P.S. you could also use requests lib, which is not a part of the Python standard lib (at 2019) but is very powerful and simple to use: http://docs.python-requests.org/en/master/

Related

Using session context manager and stream=True at the same time

I was wondering if this code is safe:
import requests
with requests.Session() as s:
response = s.get("http://google.com", stream=True)
content = response.content
For example simple like this, this does not fail (note I don't write "it works" :p), since the pool does not close instantly the connection anyway (that's a point of a session/pool right?).
Using stream=True, the response object is supposed to have its raw attribute that contains the connection, but I'm unsure if the connection is owned by the session or not, and then if at some point if I don't read the content right now but later, it might have been closed.
My current 2 cents is that it's unsafe, but I'm not 100% sure.
Thanks!
[Edit after reading requests code]
After reading the requests code in more details, it seems that it's what requests.get is doing itself:
https://github.com/requests/requests/blob/master/requests/api.py
def request(method, url, **kwargs):
with sessions.Session() as session:
return session.request(method=method, url=url, **kwargs)
Being that kwargs might contain stream.
So I guess the answer is "it's safe".
Ok, I got my answer by digging in requests and urllib3
The Response object own the connection until the close() method is explicitly called:
https://github.com/requests/requests/blob/24092b11d74af0a766d9cc616622f38adb0044b9/requests/models.py#L937-L948
def close(self):
"""Releases the connection back to the pool. Once this method has been
called the underlying ``raw`` object must not be accessed again.
*Note: Should not normally need to be called explicitly.*
"""
if not self._content_consumed:
self.raw.close()
release_conn = getattr(self.raw, 'release_conn', None)
if release_conn is not None:
release_conn()
release_conn() is a urllib3 method, this releases the connection and put it back in the pool
def release_conn(self):
if not self._pool or not self._connection:
return
self._pool._put_conn(self._connection)
self._connection = None
If the Session was destroyed (like using the with), the pool was destroyed, the connection cannot be put back and it's just closed.
TL;DR; This is safe.
Note that this also means that in stream mode, the connection is not available for the pool until the response is closed explicitly or the content read entirely.

Python + requests + splinter: What's the fastest/best way to make multiple concurrent 'get' requests?

Currently taking a web scraping class with other students, and we are supposed to make ‘get’ requests to a dummy site, parse it, and visit another site.
The problem is, the content of the dummy site is only up for several minutes and disappears, and the content comes back up at a certain interval. During the time the content is available, everyone tries to make the ‘get’ requests, so mine just hangs until everyone clears up, and the content eventually disappears. So I end up not being able to successfully make the ‘get’ request:
import requests
from splinter import Browser
browser = Browser('chrome')
# Hangs here
requests.get('http://dummysite.ca').text
# Even if get is successful hangs here as well
browser.visit(parsed_url)
So my question is, what's the fastest/best way to make endless concurrent 'get' requests until I get a response?
Decide to use either requests or splinter
Read about Requests: HTTP for Humans
Read about Splinter
Related
Read about keep-alive
Read about blocking-or-non-blocking
Read about timeouts
Read about errors-and-exceptions
If you are able to get not hanging requests, you can think of repeated requests, for instance:
while True:
requests.get(...
if request is succesfull:
break
time.sleep(1)
Gevent provides a framework for running asynchronous network requests.
It can patch Python's standard library so that existing libraries like requests and splinter work out of the box.
Here is a short example of how to make 10 concurrent requests, based on the above code, and get their response.
from gevent import monkey
monkey.patch_all()
import gevent.pool
import requests
pool = gevent.pool.Pool(size=10)
greenlets = [pool.spawn(requests.get, 'http://dummysite.ca')
for _ in range(10)]
# Wait for all requests to complete
pool.join()
for greenlet in greenlets:
# This will raise any exceptions raised by the request
# Need to catch errors, or check if an exception was
# thrown by checking `greenlet.exception`
response = greenlet.get()
text_response = response.text
Could also use map and a response handling function instead of get.
See gevent documentation for more information.
In this situation, concurrency will not help much since the server seems to be the limiting factor. One solution is to send a request with a timeout interval, if the interval has exceeded, then try the request again after a few seconds. Then gradually increase the time between retries until you get the data that you want. For instance, your code might look like this:
import time
import requests
def get_content(url, timeout):
# raise Timeout exception if more than x sends have passed
resp = requests.get(url, timeout=timeout)
# raise generic exception if request is unsuccessful
if resp.status_code != 200:
raise LookupError('status is not 200')
return resp.content
timeout = 5 # seconds
retry_interval = 0
max_retry_interval = 120
while True:
try:
response = get_content('https://example.com', timeout=timeout)
retry_interval = 0 # reset retry interval after success
break
except (LookupError, requests.exceptions.Timeout):
retry_interval += 10
if retry_interval > max_retry_interval:
retry_interval = max_retry_interval
time.sleep(retry_interval)
# process response
If concurrency is required, consider the Scrapy project. It uses the Twisted framework. In Scrapy you can replace time.sleep with reactor.callLater(fn, *args, **kw) or use one of hundreds of middleware plugins.
From the documentation for requests:
If the remote server is very slow, you can tell Requests to wait
forever for a response, by passing None as a timeout value and then
retrieving a cup of coffee.
import requests
#Wait potentially forever
r = requests.get('http://dummysite.ca', timeout=None)
#Check the status code to see how the server is handling the request
print r.status_code
Status codes beginning with 2 mean the request was received, understood, and accepted. 200 means the request was a success and the information returned. But 503 means the server is overloaded or undergoing maintenance.
Requests used to include a module called async which could send concurrent requests. It is now an independent module named grequests
which you can use to make concurrent requests endlessly until a 200 response:
import grequests
urls = [
'http://python-requests.org', #Just include one url if you want
'http://httpbin.org',
'http://python-guide.org',
'http://kennethreitz.com'
]
def keep_going():
rs = (grequests.get(u) for u in urls) #Make a set of unsent Requests
out = grequests.map(rs) #Send them all at the same time
for i in out:
if i.status_code == 200:
print i.text
del urls[out.index(i)] #If we have the content, delete the URL
return
while urls:
keep_going()

Google API client (Python): is it possible to use BatchHttpRequest with ETag caching

I'm using YouTube data API v3.
Is it possible to make a big BatchHttpRequest (e.g., see here) and also to use ETags for local caching at the httplib2 level (e.g., see here)?
ETags work fine for single queries, I don't understand if they are useful also for batch requests.
TL;DR:
BatchHttpRequest cannot be used with caching
HERE IT IS:
First lets see the way to initialize BatchHttpRequest:
from apiclient.http import BatchHttpRequest
def list_animals(request_id, response, exception):
if exception is not None:
# Do something with the exception
pass
else:
# Do something with the response
pass
def list_farmers(request_id, response):
"""Do something with the farmers list response."""
pass
service = build('farm', 'v2')
batch = service.new_batch_http_request()
batch.add(service.animals().list(), callback=list_animals)
batch.add(service.farmers().list(), callback=list_farmers)
batch.execute(http=http)
Second lets see how ETags are used:
from google.appengine.api import memcache
http = httplib2.Http(cache=memcache)
Now lets analyze:
Observe the last line of BatchHttpRequest example: batch.execute(http=http), and now checking the source code for execute, it calls _refresh_and_apply_credentials, which applies the http object we pass it.
def _refresh_and_apply_credentials(self, request, http):
"""Refresh the credentials and apply to the request.
Args:
request: HttpRequest, the request.
http: httplib2.Http, the global http object for the batch.
"""
# For the credentials to refresh, but only once per refresh_token
# If there is no http per the request then refresh the http passed in
# via execute()
Which means, execute call which takes in http, can be passed the ETag http you would have created as:
http = httplib2.Http(cache=memcache)
# This would mean we would get the ETags cached http
batch.execute(http=http)
Update 1:
Could try with a custom object as well:
from googleapiclient.discovery_cache import DISCOVERY_DOC_MAX_AGE
from googleapiclient.discovery_cache.base import Cache
from googleapiclient.discovery_cache.file_cache import Cache as FileCache
custCache = FileCache(max_age=DISCOVERY_DOC_MAX_AGE)
http = httplib2.Http(cache=custCache)
# This would mean we would get the ETags cached http
batch.execute(http=http)
Because, this is just a hunch on the comment in http2 lib:
"""If 'cache' is a string then it is used as a directory name for
a disk cache. Otherwise it must be an object that supports the
same interface as FileCache.
Conclusion Update 2:
After again verifying the google-api-python source code, I see that, BatchHttpRequest is fixed with 'POST' request and has a content-type of multipart/mixed;.. - source code.
Giving a clue about the fact that, BatchHttpRequest is useful in order to POST data which is then processed down the later.
Now, keeping that in mind, observing what httplib2 request method uses: _updateCache only when following criteria are met:
Request is in ["GET", "HEAD"] or response.status == 303 or is a redirect request
ElSE -- response.status in [200, 203] and method in ["GET", "HEAD"]
OR -- if response.status == 304 and method == "GET"
This means, BatchHttpRequest cannot be used with caching.

Mock a HTTP request that times out with HTTPretty

Using the HTTPretty library for Python, I can create mock HTTP responses of choice and then pick them up i.e. with the requests library like so:
import httpretty
import requests
# set up a mock
httpretty.enable()
httpretty.register_uri(
method=httpretty.GET,
uri='http://www.fakeurl.com',
status=200,
body='My Response Body'
)
response = requests.get('http://www.fakeurl.com')
# clean up
httpretty.disable()
httpretty.reset()
print(response)
Out: <Response [200]>
Is there also the possibility to register an uri which cannot be reached (e.g. connection timed out, connection refused, ...) such that no response is received at all (which is not the same as an established connection which gives an HTTP error code like 404)?
I want to use this behaviour in unit testing to ensure that my error handling works as expected (which does different things in case of 'no connection established' and 'connection established, bad bad HTTP status code'). As a workaround, I could try to connect to an invalid server like http://192.0.2.0 which would time out in any case. However, I would prefer to do all my unit testing without using any real network connections.
Meanwhile I got it, using a HTTPretty callback body seems to produce the desired behaviour. See inline comments below.
This is actually not exactly the same as I was looking for (it is not a server that cannot be reached and hence the request times out but a server that throws a timeout exception once it is reached, however, the effect is the same for my usecase.
Still, if anybody knows a different solution, I'm looking forward to it.
import httpretty
import requests
# enable HTTPretty
httpretty.enable()
# create a callback body that raises an exception when opened
def exceptionCallback(request, uri, headers):
# raise your favourite exception here, e.g. requests.ConnectionError or requests.Timeout
raise requests.Timeout('Connection timed out.')
# set up a mock and use the callback function as response's body
httpretty.register_uri(
method=httpretty.GET,
uri='http://www.fakeurl.com',
status=200,
body=exceptionCallback
)
# try to get a response from the mock server and catch the exception
try:
response = requests.get('http://www.fakeurl.com')
except requests.Timeout as e:
print('requests.Timeout exception got caught...')
print(e)
# do whatever...
# clean up
httpretty.disable()
httpretty.reset()

HTTPConnection to make DELETE request: 505 response

Frustratingly I'm needing to develop something on Python 2.6.4, and need to send a delete request to a server that seems to only support http 1.1. Here is my code:
httpConnection = httplib.HTTPConnection("localhost:9080")
httpConnection.request('DELETE', remainderURL)
httpResponse = httpConnection.getresponse()
The response code I then get is: 505 (HTTP version not supported)
I've tested sending a delete request via Firefox's RESTClient to the same URL and that works.
I can't use urllib2 because it doesn't support the DELETE request. Is the HTTPConnection object http 1.0 only? Or am I doing something wrong?
The HTTPConnection class uses HTTP/1.1 throughout, and the 505 seems to indicate it's the server that cannot handle HTTP/1.1 requests.
However, if you need to make DELETE requests, why not use the Requests package instead? A DELETE is as simple as:
import requests
requests.delete(url)
That won't magically solve your HTTP version mismatch, but you can enable verbose logging to figure out what is going on:
import sys
requests.delete(url, config=dict(verbose=sys.stderr))
You can use urllib2:
req = urllib2.Request(query_url)
req.get_method = lambda: 'DELETE' # creates the delete method
url = urllib2.urlopen(req)
httplib uses HTTP/1.1 (see HTTPConnection.putRequest method documentation).
Check httpResponse.version to see what version the server is using.

Categories

Resources