Why Python http request creates TIME_WAIT connections? - python

I have this simple code, which connects with an external server. I call this function 100s of time a minute. And after a while I'm getting system lacked sufficient buffer exception. When I viewed the connections using TCPView it shows hundreds of connections to external server in TIME_WAIT status.
Why this is happening?
Is python request module not suitable if I have to send 100s of requests, then what should I do?
def sendGetRequest(self, url, payload):
success = True
url = self.generateUrl(url)
result = requests.get(url, params=urllib.parse.urlencode(payload))
code = result.status_code
text = result.text
if code < 200 or code >= 300:
success = False
result.close()
return success, code, text

You are closing a lot of connections you opened with requests at the client side, where the server expected them to be re-used instead.
Because HTTP is a TCP protocol, a bidirectional protocol, closing a socket on the client side means the socket can't yet fully close until the other end (the server end) acknowledges that the connection has been closed properly. Until the acknowledgement has been exchanged with the server (or until a timeout, set to 2x the maximum segment lifetime is reached), the socket remains in the TIME_WAIT state. In HTTP closing normally happens on the server side, after a response has been completed; it is the server that'll wait for your client to acknowledge closure.
You see a lot of these on your side, because each new connection must use a new local port number. A server doesn't see nearly the same issues because it uses a fixed port number for the incoming requests, and that single port number can accept more connections even though there may be any number of outstanding TIME_WAIT connection states. A lot of local outgoing ports in TIME_WAIT on the other hand means you'll eventually run out of local ports to connect from.
This is not unique to Python or to requests.
What you instead should do is minimize the number of connections and minimize closing. Modern HTTP servers expect you to be reusing connections for multiple requests. You want to use a requests.Session() object, so it can manage connections for you, and then do not close the connections yourself.
You can also drastically simplify your function by using standard requests functionality; params already handles url encoding, for example, and comparisons already give you a boolean value you could assign directly to success:
session = requests.Session()
def sendGetRequest(self, url, payload):
result = session.get(self.generateUrl(url), params=payload)
success = 200 <= result.status_code < 300
return success, result.status_code, result.text
Note that a 3xx status code is already handled automatically, so you could just use response.ok:
def sendGetRequest(self, url, payload):
result = session.get(self.generateUrl(url), params=payload)
return result.ok, result.status_code, result.text
Next, you may want to consider using asyncio coroutines (and aiohttp, still using sessions) to make all those check requests. That way your code doesn't have to sit idle for each request-response roundtrip to complete, but could be doing something else in that intervening period. I've build applications that handle 1000s of concurrent HTTP requests at a time without breaking a sweat, all the while doing lots of meaningful operations while slow network I/O operations are completing.

Related

Does setting socket timeout cancel the initial request

I have a request that can only run once. At times, the request takes much longer than it should.
If I were to set a default socket timeout value (using socket.setdefaulttimeout(5)), and it took longer than 5 seconds, will the original request be cancelled so it's safe to retry (see example code below)?
If not, what is the best way to cancel the original request and retry it again ensuring it never runs more than once.
import socket
from googleapiclient.discovery import build
from tenacity import retry, stop_after_attempt, wait_fixed, retry_if_exception_type
#retry(
retry=retry_if_exception_type(socket.timeout),
wait=wait_fixed(4),
stop=stop_after_attempt(3)
)
def create_file_once_only(creds, body):
service = build('drive', 'v3', credentials=creds)
file = service.files().create(body=body, fields='id').execute()
socket.setdefaulttimeout(5)
create_file_once_only(creds, body)
It's unlikely that this can be made to work as you hope. An HTTP POST (as with any other HTTP request) is implemented by sending a command to the web server, then receiving a response. The python requests library encapsulates a lot of tedious parts of that for you, but at the core, it's going to do a socket send followed by a socket recv (it may of course require more than one send or recv depending on the size of the data).
Now, if you were able to connect to the web server initially (again, this is taken care of for you by the requests library but typically only takes a few milliseconds), then it's highly likely that the data in your POST request has long since been sent. (If the data you are sending is megabytes long, it's possible that it's only been partially sent, but if it is reasonably short, it's almost certainly been sent in full.)
That in turn means that in all likelihood the server has received your entire request and is working on it or has enqueued your request to work on it eventually. In either case, even if you break the connection to the server by timing out on the recv, it's unlikely that the server will actually even notice that until it gets to the point in its execution where it would be sending its response to your request. By that point, it has probably finished doing whatever it was going to do.
In other words, your socket timeout is not going to apply to the "HTTP request" -- it applies to the underlying socket operations instead -- and almost certainly to the recv part on the tail end. And just breaking the socket connection doesn't cancel the HTTP request.
There is no reliable way to do what you want without designing a transactional protocol with the close cooperation of the HTTP server.
You could do something (with the cooperation of the HTTP server still) that could do something approximating it:
Create a unique ID (UUID or the like)
Send a request to the server that contains that UUID along with the other account info (name, password, whatever else)
The server then only creates the account if it hasn't already created an account with the same unique ID.
That way, you can request the operation multiple times, but know that it will only actually be implemented once. If asked to do the same operation a second time, the server would simply respond with "yep, already did that".

python requests keep connection alive for indefinite time

I'm trying to get a python script running which calls an external API (to which I only have read-access) in a certain interval, the API uses cookie-based authentication: Calling the /auth endpoint initially sets session cookies which are then used for authentication in further requests.
As for my problem: Because the authentication is based on an active session, the cookies aren't valid once the connection drops, and therefore has to be restarted. From what I've read, requests is based on urllib3, which keeps the connection alive by default. Yet, after a few tests I noticed that under some circumstances, the connection will be dropped anyway.
I used a Session object from the requests module and I've tested how long it takes for the connection to be dropped as follows:
from requests import session
import logging
import time import time, sleep
logging.basicConfig(level=logging.DEBUG)
def tt(interval):
credentials = {"username":"user","password":"pass"}
s = Session()
r = s.post("https://<host>:<port>/auth", json=credentials)
ts = time()
while r.status_code is 200:
r = s.get("https://<host>:<port>/some/other/endpoint")
sleep(interval)
return time() - ts # Seconds until connection drop
Might not be the best way to find that out, but I let that function run twice, once with an interval of 1 second and then with an interval of 1 minute. Both had run for about an hour until I had to manually stop the execution.
However, when I swapped the two lines within the while loop, which meant that there was a 1-minute-delay after the initial POST /auth request, the following GET request failed with a 401 Unauthorized and this message being logged beforehand:
DEBUG:urllib3.connectionpool:Resetting dropped connection: <host>
As the interval of requests may range from a few minutes to multiple hours in my prod script, I have to know beforehand how long these sessions are kept alive and whether there are some exceptions to that rule (like dropping the connection if no request after the initial POST /auth is made for a short while).
So, how long does requests or rather urllib3 keep the connection alive, and is it possible to extend that time indefinitely?
Or is it the server instead of requests that drops the connection?
By using requests.Session, keep-alive is handled for you automatically.
In the first version of your loop that continuously polls the server after the /auth call is made, the server does not drop the connection due to the subsequent GET that happens. In the second version, it's likely that sleep interval exceeds the amount of time the server is configured to keep the connection open.
Depending on the server configuration of the API, the response headers may include a Keep-Alive header with information about how long connections are kept open at a minimum. HTTP/1.0 specifies this information is included in the timeout parameter of the Keep-Alive header. You could use this information to determine how long you have until the server will drop the connection.
In HTTP/1.1, persistent connections are used by default and the Keep-Alive header is not used unless the server explicitly implements it for backwards compatibility. Due to this difference, there isn't an immediate way for a client to determine the exact timeout for connections since it may exist solely as server side configuration.
The key to keeping the connection open would be to continue polling at regular intervals. The interval you use must be less than the server's configured connection timeout.
One other thing to point out is that artificially extending the length of the session indefinitely this way makes one more vulnerable to session fixation attacks. You may want to consider adding logic that occasionally reestablishes the session to minimize risk of these types of attacks.

using socket.shutdown(1) is preventing web server from sending responses

I'm making a proxy which sits between the browser and the web. There's a snippet of code I can't seem to get to work.
#send request to web server
web_client.send(request)
#signal client is done with sending
web_client.shutdown(1)
If I use shutdown(1), the proxy has a great improvement in performance and speed.
However, some web servers do not send responses if I use shutdown. Console output:
request sent to host wix.com
got response packet of len 0
got response packet of len 0
breaking loop
and the browser displays
The connection was reset
The connection to the server was reset while the page was loading.
However, if I remove shutdown(1), there are no problems of sort. Console output:
got response packet of len 1388
got response packet of len 1388
got response packet of len 1388
got response packet of len 989
got response packet of len 0
got response packet of len 0
breaking loop
and the browser normally displays the website.
Why is this happening? This is only happening on certain hosts.
From https://docs.python.org/2/library/socket.html#socket.socket.shutdown
Depending on the platform, shutting down one half of the connection
can also close the opposite half (e.g. on Mac OS X, shutdown(SHUT_WR)
does not allow further reads on the other end of the connection)
This may not be the problem because you say that only some web servers are affected, but is your proxy running on Mac OS X?
TCP/IP stack will do graceful connection close only if there is no pending data to be sent on the socket. send completion indicates only the data is pushed into the kernel buffer & ready for sending. Here, shutdown is invoked immediately after send while there is some send data pending in the TCP stack. So TCP stack sends reset to the other end as it decides the application doesn't wish to complete sending the process. To do a graceful connection close, invoke select on the socket & wait for socket to be writable which means all the data is pushed out the stack. Then invoke shutdown & close the socket.

Python HTTP client with request pipelining

The problem: I need to send many HTTP requests to a server. I can only use one connection (non-negotiable server limit). The server's response time plus the network latency is too high – I'm falling behind.
The requests typically don't change server state and don't depend on the previous request's response. So my idea is to simply send them on top of each other, enqueue the response objects, and depend on the Content-Length: of the incoming responses to feed incoming replies to the next-waiting response object. In other words: Pipeline the requests to the server.
This is of course not entirely safe (any reply without Content-Length: means trouble), but I don't care -- in that case I can always retry any queued requests. (The safe way would be to wait for the header before sending the next bit. That'd might help me enough. No way to test beforehand.)
So, ideally I want the following client code (which uses client delays to mimic network latency) to run in three seconds.
Now for the $64000 question: Is there a Python library which already does this, or do I need to roll my own? My code uses gevent; I could use Twisted if necessary, but Twisted's standard connection pool does not support pipelined requests. I also could write a wrapper for some C library if necessary, but I'd prefer native code.
#!/usr/bin/python
import gevent.pool
from gevent import sleep
from time import time
from geventhttpclient import HTTPClient
url = 'http://local_server/100k_of_lorem_ipsum.txt'
http = HTTPClient.from_url(url, concurrency=1)
def get_it(http):
print time(),"Queueing request"
response = http.get(url)
print time(),"Expect header data"
# Do something with the header, just to make sure that it has arrived
# (the greenlet should block until then)
assert response.status_code == 200
assert response["content-length"] > 0
for h in response.items():
pass
print time(),"Wait before reading body data"
# Now I can read the body. The library should send at
# least one new HTTP request during this time.
sleep(2)
print time(),"Reading body data"
while response.read(10000):
pass
print time(),"Processing my response"
# The next request should definitely be transmitted NOW.
sleep(1)
print time(),"Done"
# Run parallel requests
pool = gevent.pool.Pool(3)
for i in range(3):
pool.spawn(get_it, http)
pool.join()
http.close()
Dugong is an HTTP/1.1-only client which claims to support real HTTP/1.1 pipelining. The tutorial includes several examples on how to use it, including one using threads and another using asyncio.
Be sure to verify that the server you're communicating with actually supports HTTP/1.1 pipelining—some servers claim to support HTTP/1.1 but don't implement pipelining.
I think txrequests could get you most of what you are looking for, using the background_callback to en-queue processing of responses on a separate thread. Each request would still be it's own thread but using a session means by default it would reuse the same connection.
https://github.com/tardyp/txrequests#working-in-the-background
It seems you are running python2.
For python3 >= 3.5
you could use async/await loop
See asyncio
Also, there is a library built on top for better, easier use
called Trio, available on pip.
Another thing I can think of is multiple threads with locks.
I will think on how to better explain this or could it even work.

Python TCP socket proxy extremely slow

I'm working on a simple proxy in python that takes a HTTP GET request from a browser, queries the correct website and returns the data (html, css, photos) to the client. I have it working, but it take an exorbitant amount of time to read the data back from the external web server and send it back to the client. Below is (what I think is) the relevant code:
tempSocket.send(requestToWebpage)
tempList = []
while 1:
print "waiting for data from website..."
data = tempSocket.recv(bufferSize)
if not data:
break
else:
tempList.append(data)
tempResponse = ''.join(tempList)
print "closing temp socket..."
tempSocket.close()
splitResponse = tempResponse.partition("\r\n")
response = splitResponse[0] + "\r\n" + "Proxy-connection: close\r\n" + splitResponse[2]
print "sending results back..."
newConnection.send(response)
newConnection.close()
The proxy is running on my own machine (as is the client browser), which is Windows 7 64-bit. I have a decent wireless connection to the internet. Currently it takes upwards of several minutes to receive the results of each GET request and transmit it to the client. By watching the print statements, I've noticed that most of the time seems to be spend in the while loop (especially the last loop through it), but the other print messages also take way longer to appear than it seems like they should.
Any ideas on what is going on and suggestions to improve the speed?
Marcus's comment is probably right. The remote server is not closing its connection.
You might be asking for this behaviour, perhaps without even realising it. What is in the request to the server, i.e. what is being sent in requestToWebpage? Are you setting a Connection: Keep-Alive header?
Keep-Alive is the default if you are using HTTP 1.1 in the request.
If it is not because of Keep-Alive, you may need to get the Content-Length from the reply and then you'll know how many bytes to read.

Categories

Resources