Python Requests, warning: urllib3.connectionpool:Connection pool is full - python

I'm using the requests library in python 3 and despite my best efforts I can't get the following warning to disappear:
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: myorganization.zendesk.com
I'm using requests in a multithreaded environment to get and post json files concurrently to a single host, definitely no subdomains. In this current set up I'm using just 20 threads.
I attempted to use a Session in order to get requests to reuse connections and thus get rid of the problem, but it hasn't worked. This is the code in my class constructor:
self.session = requests.Session()
adapter = requests.adapters.HTTPAdapter(
pool_connections=100, pool_maxsize=100)
self.session.mount('http://', adapter)
self.session.headers.update({'Connection':'Keep-Alive'})
self.session.auth = (self._user+"/token", self._token)
According to advice from here I shouldn't need to increase the pooled connections by that much considering the number of threads I'm using, but despite this I get this warning even when raising by 100.
This makes me think that connections are not being reused at all, or if they are, too many are being created for some reason. I've updated requests, so it is the most up to date version.
Does anyone have any ideas how I can get rid of this? I'm debugging some code and I think this is the blame for some requests not being made correctly.
Related:
Can I change the connection pool size for Python's "requests" module?

Since zendesk communicates over https, you just need to mount the adapter to the https protocol, i.e.
self.session.mount('https://', adapter)

Related

Python HTTP client with request pipelining

The problem: I need to send many HTTP requests to a server. I can only use one connection (non-negotiable server limit). The server's response time plus the network latency is too high – I'm falling behind.
The requests typically don't change server state and don't depend on the previous request's response. So my idea is to simply send them on top of each other, enqueue the response objects, and depend on the Content-Length: of the incoming responses to feed incoming replies to the next-waiting response object. In other words: Pipeline the requests to the server.
This is of course not entirely safe (any reply without Content-Length: means trouble), but I don't care -- in that case I can always retry any queued requests. (The safe way would be to wait for the header before sending the next bit. That'd might help me enough. No way to test beforehand.)
So, ideally I want the following client code (which uses client delays to mimic network latency) to run in three seconds.
Now for the $64000 question: Is there a Python library which already does this, or do I need to roll my own? My code uses gevent; I could use Twisted if necessary, but Twisted's standard connection pool does not support pipelined requests. I also could write a wrapper for some C library if necessary, but I'd prefer native code.
#!/usr/bin/python
import gevent.pool
from gevent import sleep
from time import time
from geventhttpclient import HTTPClient
url = 'http://local_server/100k_of_lorem_ipsum.txt'
http = HTTPClient.from_url(url, concurrency=1)
def get_it(http):
print time(),"Queueing request"
response = http.get(url)
print time(),"Expect header data"
# Do something with the header, just to make sure that it has arrived
# (the greenlet should block until then)
assert response.status_code == 200
assert response["content-length"] > 0
for h in response.items():
pass
print time(),"Wait before reading body data"
# Now I can read the body. The library should send at
# least one new HTTP request during this time.
sleep(2)
print time(),"Reading body data"
while response.read(10000):
pass
print time(),"Processing my response"
# The next request should definitely be transmitted NOW.
sleep(1)
print time(),"Done"
# Run parallel requests
pool = gevent.pool.Pool(3)
for i in range(3):
pool.spawn(get_it, http)
pool.join()
http.close()
Dugong is an HTTP/1.1-only client which claims to support real HTTP/1.1 pipelining. The tutorial includes several examples on how to use it, including one using threads and another using asyncio.
Be sure to verify that the server you're communicating with actually supports HTTP/1.1 pipelining—some servers claim to support HTTP/1.1 but don't implement pipelining.
I think txrequests could get you most of what you are looking for, using the background_callback to en-queue processing of responses on a separate thread. Each request would still be it's own thread but using a session means by default it would reuse the same connection.
https://github.com/tardyp/txrequests#working-in-the-background
It seems you are running python2.
For python3 >= 3.5
you could use async/await loop
See asyncio
Also, there is a library built on top for better, easier use
called Trio, available on pip.
Another thing I can think of is multiple threads with locks.
I will think on how to better explain this or could it even work.

Change the connection pool size for Python's "requests" module when in Threading

(edit: Perhaps I am wrong in what this error means. Is this indicating that the connection pool at my CLIENT is full? or a connection pool at the SERVER is full and this is the error my client is being given?)
I am attempting to make a large number of http requests concurrently using the python threading and requests module. I am seeing this error in logs:
WARNING:requests.packages.urllib3.connectionpool:HttpConnectionPool is full, discarding connection:
What can I do to increase the size of the connection pool for requests?
This should do the trick:
import requests.adapters
session = requests.Session()
adapter = requests.adapters.HTTPAdapter(pool_connections=100, pool_maxsize=100)
session.mount('http://', adapter)
response = session.get("/mypage")
Note: Use this solution only if you cannot control the construction of the connection pool (as described in #Jahaja's answer).
The problem is that the urllib3 creates the pools on demand. It calls the constructor of the urllib3.connectionpool.HTTPConnectionPool class without parameters. The classes are registered in urllib3 .poolmanager.pool_classes_by_scheme. The trick is to replace the classes with your classes that have different default parameters:
def patch_http_connection_pool(**constructor_kwargs):
"""
This allows to override the default parameters of the
HTTPConnectionPool constructor.
For example, to increase the poolsize to fix problems
with "HttpConnectionPool is full, discarding connection"
call this function with maxsize=16 (or whatever size
you want to give to the connection pool)
"""
from urllib3 import connectionpool, poolmanager
class MyHTTPConnectionPool(connectionpool.HTTPConnectionPool):
def __init__(self, *args,**kwargs):
kwargs.update(constructor_kwargs)
super(MyHTTPConnectionPool, self).__init__(*args,**kwargs)
poolmanager.pool_classes_by_scheme['http'] = MyHTTPConnectionPool
Then you can call to set new default parameters. Make sure this is called before any connection is made.
patch_http_connection_pool(maxsize=16)
If you use https connections you can create a similar function:
def patch_https_connection_pool(**constructor_kwargs):
"""
This allows to override the default parameters of the
HTTPConnectionPool constructor.
For example, to increase the poolsize to fix problems
with "HttpSConnectionPool is full, discarding connection"
call this function with maxsize=16 (or whatever size
you want to give to the connection pool)
"""
from urllib3 import connectionpool, poolmanager
class MyHTTPSConnectionPool(connectionpool.HTTPSConnectionPool):
def __init__(self, *args,**kwargs):
kwargs.update(constructor_kwargs)
super(MyHTTPSConnectionPool, self).__init__(*args,**kwargs)
poolmanager.pool_classes_by_scheme['https'] = MyHTTPSConnectionPool
Jahaja's answer already gives the recommended solution to your problem, but it does not answer what is going on or, as you asked, what this error means.
Some very detailed information about this is in urllib3 official documentation, the package requests uses under the hood to actually perform its requests. Here are the relevant parts for your question, adding a few notes of my own and ommiting code examples since requests have a different API:
The PoolManager class automatically handles creating ConnectionPool instances for each host as needed. By default, it will keep a maximum of 10 ConnectionPool instances [Note: That's pool_connections in requests.adapters.HTTPAdapter(), and it has the same default value of 10]. If you’re making requests to many different hosts it might improve performance to increase this number
However, keep in mind that this does increase memory and socket consumption.
Similarly, the ConnectionPool class keeps a pool of individual HTTPConnection instances. These connections are used during an individual request and returned to the pool when the request is complete. By default only one connection will be saved for re-use [Note: That's pool_maxsize in HTTPAdapter(), and requests changes the default value from 1 to 10]. If you are making many requests to the same host simultaneously it might improve performance to increase this number
The behavior of the pooling for ConnectionPool is different from PoolManager. By default, if a new request is made and there is no free connection in the pool then a new connection will be created. However, this connection will not be saved if more than maxsize connections exist. This means that maxsize does not determine the maximum number of connections that can be open to a particular host, just the maximum number of connections to keep in the pool. However, if you specify block=True [Note: Available as pool_block in HTTPAdapter()] then there can be at most maxsize connections open to a particular host
Given that, here's what happened in your case:
All pools mentioned are CLIENT pools. You (or requests) have no control over any server connection pools
That warning is about HttpConnectionPool, i.e, the number of simultaneous connections made to the same host, so you could increase pool_maxsize to match the number of workers/threads you're using to get rid of the warning.
Note that requests is already opening as many simultaneous connections as you ask for, regardless of pool_maxsize. If you have 100 threads, it will open 100 connections. But with the default value only 10 of them will be kept in the pool for later reuse, and 90 will be discarded after completing the request.
Thus, a larger pool_maxsize increases performance to a single host by reusing connections, not by increasing concurrency.
If you're dealing with multiple hosts, then you might change pool_connections instead. The default is 10 already, so if all your requests are to the same target host, increasing it will not have any effect on performance (but it will increase the resources used, as said in above documentation)
In case anyone needs to do it with Python Zeep and wants to safe bit of time to figure out
here is a quick recipe:
from zeep import Client
from requests import adapters as request_adapters
soap = "http://example.com/BLA/sdwl.wsdl"
wsdl_path = "http://example.com/PATH/TO_WSLD?wsdl"
bind = "Binding"
client = Client(wsdl_path) # Create Client
# switch adapter
session = client.transport.session
adapter = request_adapters.HTTPAdapter(pool_connections=10, pool_maxsize=10)
# mount adapter
session.mount('https://', adapter)
binding = '{%s}%s' % (soap, bind)
# Create Service
service = client.create_service(binding, wsdl_path.split('?')[0])
Basically the connection should be created before creating the service
The answer is actualy taken from the python-zeep Repo from a closed issue,
for refence I'll add it --> here

How to properly forward requests through proxies with MITMProxy?

Trying to use MITMProxy to do custom forwarding to requests made from the Firefox browser, so that they go through one of several proxies selected at runtime. It is performing too slow for our purposes. Please bear in mind we are running this in Python 2.7.
The process is as follows:
Firefox sends request to configured MITMProxy.
MITMProxy takes the request from Firefox and generates a requests request and gets the data from the target server through a given proxy (which is not controlled by us and require authentication).
The response from the proxy-forwarded request gets converted into a response for the browser.
MITMProxy returns the data to the browser.
The situation seems to be that this process is too slow, which I believe could be for a number of reasons. It could be that there are settings enabled which bring down performance (such as too much logging, for example), the procedure being used is not the right one for the job (totally plausible) or something completely different.
How can we make this run faster?
Thanks very much! Any and all suggestions will be appreciated!
In this particular case, we were using the script feature of MITMProxy, which meant every modified request was executed synchronously (i.e., we could not use proper asynchronous behavior). This naturally became an issue once we started using the scripts with more clients.
As #Puciek mentioned in his comment, this was more a design issue than a problem with the library.

Python urllib2: Cannot assign requested address

I am sending thousands of requests using urllib2 with proxies. I have received many of the following error on execution:
urlopen error [Errno 99] Cannot assign requested address
I read here that it may be due to a socket already being bonded. Is that the case? Any suggestions on how to fix this?
Here is an answer to a similar looking question that I prepared earlier.... much earlier...
Socket in use error when reusing sockets
The error is different, but the underlying problem is probably the same: you are consuming all available ports and trying to reuse them before the TIME_WAIT state has ended.
[EDIT: in response to comments]
If it is within the capability/spec for your application, one obvious strategy is to control the rate of connections to avoid this situation.
Alternatively, you could use the httplib module. httplib.HTTPConnection() lets you specify a source_address tuple with which you can specify the port from which to make the connection, e.g. this will connect to localhost:1234 from localhost:9999:
import httplib
conn = httplib.HTTPConnection('localhost:1234', source_address=('localhost',9999))
conn.request('GET', '/index.html')
Then it is a matter of managing the source port assignment as described in my earlier answer. If you are on Windows you can use this method to get around the default range of ports 1024-5000.
There is (of course), an upper limit to how many connections you are going to be able to make and it is questionable what sort of an application would require making thousands of connections in rapid succession.
As mhawke suggested, the issue of TIME_WAIT seems most likely. The system wide fix for your situation can be to adjust kernel parameters so such connections are cleaned up more often. Two options:
$ sysctl net.ipv4.tcp_tw_recycle=1
This will let the kernel reuse connections in TIME_WAIT state. This may cause issues with NAT setups. Another one is:
$ sysctl net.ipv4.tcp_max_orphans=8192
$ sysctl net.ipv4.tcp_orphan_retries=1
This tells the kernel to keep at most 8192 connections not attached to any user process and only retry once before killing TCP connections.
Note that these are not permanent changes. Add the setting to /etc/sysctl.conf to make them permanent.
http://code.google.com/p/lusca-cache/issues/detail?id=89#c4
http://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.kernel.obscure.html
I have had a similar issue but was using POST command using python's request library though!!
To make it worse, I used multiprocessing over each executor to post to a server. So thousands of connections created in seconds that took few seconds each to change the state from TIME_WAIT and release the ports for the next set of connections.
Out of all the available solutions available over the internet that speak of disabling keep-alive, using with request.Session() et al, I found this answer to be working which makes use of 'Connection' : 'close' configuration as header parameter. You may need to put the header content in a separte line outside the post command though.
headers = {
'Connection': 'close'
}
with requests.Session() as session:
response = session.post('https://xx.xxx.xxx.x/xxxxxx/x', headers=headers, files=files, verify=False)
results = response.json()
print results
Just give it a try with request library.

How to speed-up a HTTP request

I need to get json data and I'm using urllib2:
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'gzip')
opener = urllib2.build_opener()
connection = opener.open(request)
data = connection.read()
but although the data aren't so big it is too slow.
Is there a way to speed it up? I can use 3rd party libraries too.
Accept-Encoding:gzip means that the client is ready to gzip Encoded content if the Server is ready to send it first. The rest of the request goes down the sockets and to over your Operating Systems TCP/IP stack and then to physical layer.
If the Server supports ETags, then you can send a If-None-Match header to ensure that content has not changed and rely on the cache. An example is given here.
You cannot do much with clients only to improve your HTTP request speed.
You're dependant on a number of different things here that may not be within your control:
Latency/Bandwidth of your connection
Latency/Bandwidth of server connection
Load of server application and its individual processes
Items 2 and 3 are probably where the problem lies and you won't be able to do much about it. Is the content cache-able? This will depend on your own application needs and HTTP headers (e.g. ETags, Cache-Control, Last-Modified) that are returned from the server. The server may only up date every day in which case you might be better off only requesting data every hour.
There is unlikely an issue with urllib. If you have network issues and performance problems: consider using tools like Wireshark to investigate on the network level. I have very strong doubts that this is related to Python in any way.
If you are making lots of requests, look into threading. Having about 10 workers making requests can speed things up - you don't grind to a halt if one of them takes too long getting a connection.

Categories

Resources