I'm working on a simple proxy in python that takes a HTTP GET request from a browser, queries the correct website and returns the data (html, css, photos) to the client. I have it working, but it take an exorbitant amount of time to read the data back from the external web server and send it back to the client. Below is (what I think is) the relevant code:
tempSocket.send(requestToWebpage)
tempList = []
while 1:
print "waiting for data from website..."
data = tempSocket.recv(bufferSize)
if not data:
break
else:
tempList.append(data)
tempResponse = ''.join(tempList)
print "closing temp socket..."
tempSocket.close()
splitResponse = tempResponse.partition("\r\n")
response = splitResponse[0] + "\r\n" + "Proxy-connection: close\r\n" + splitResponse[2]
print "sending results back..."
newConnection.send(response)
newConnection.close()
The proxy is running on my own machine (as is the client browser), which is Windows 7 64-bit. I have a decent wireless connection to the internet. Currently it takes upwards of several minutes to receive the results of each GET request and transmit it to the client. By watching the print statements, I've noticed that most of the time seems to be spend in the while loop (especially the last loop through it), but the other print messages also take way longer to appear than it seems like they should.
Any ideas on what is going on and suggestions to improve the speed?
Marcus's comment is probably right. The remote server is not closing its connection.
You might be asking for this behaviour, perhaps without even realising it. What is in the request to the server, i.e. what is being sent in requestToWebpage? Are you setting a Connection: Keep-Alive header?
Keep-Alive is the default if you are using HTTP 1.1 in the request.
If it is not because of Keep-Alive, you may need to get the Content-Length from the reply and then you'll know how many bytes to read.
Related
I have this simple code, which connects with an external server. I call this function 100s of time a minute. And after a while I'm getting system lacked sufficient buffer exception. When I viewed the connections using TCPView it shows hundreds of connections to external server in TIME_WAIT status.
Why this is happening?
Is python request module not suitable if I have to send 100s of requests, then what should I do?
def sendGetRequest(self, url, payload):
success = True
url = self.generateUrl(url)
result = requests.get(url, params=urllib.parse.urlencode(payload))
code = result.status_code
text = result.text
if code < 200 or code >= 300:
success = False
result.close()
return success, code, text
You are closing a lot of connections you opened with requests at the client side, where the server expected them to be re-used instead.
Because HTTP is a TCP protocol, a bidirectional protocol, closing a socket on the client side means the socket can't yet fully close until the other end (the server end) acknowledges that the connection has been closed properly. Until the acknowledgement has been exchanged with the server (or until a timeout, set to 2x the maximum segment lifetime is reached), the socket remains in the TIME_WAIT state. In HTTP closing normally happens on the server side, after a response has been completed; it is the server that'll wait for your client to acknowledge closure.
You see a lot of these on your side, because each new connection must use a new local port number. A server doesn't see nearly the same issues because it uses a fixed port number for the incoming requests, and that single port number can accept more connections even though there may be any number of outstanding TIME_WAIT connection states. A lot of local outgoing ports in TIME_WAIT on the other hand means you'll eventually run out of local ports to connect from.
This is not unique to Python or to requests.
What you instead should do is minimize the number of connections and minimize closing. Modern HTTP servers expect you to be reusing connections for multiple requests. You want to use a requests.Session() object, so it can manage connections for you, and then do not close the connections yourself.
You can also drastically simplify your function by using standard requests functionality; params already handles url encoding, for example, and comparisons already give you a boolean value you could assign directly to success:
session = requests.Session()
def sendGetRequest(self, url, payload):
result = session.get(self.generateUrl(url), params=payload)
success = 200 <= result.status_code < 300
return success, result.status_code, result.text
Note that a 3xx status code is already handled automatically, so you could just use response.ok:
def sendGetRequest(self, url, payload):
result = session.get(self.generateUrl(url), params=payload)
return result.ok, result.status_code, result.text
Next, you may want to consider using asyncio coroutines (and aiohttp, still using sessions) to make all those check requests. That way your code doesn't have to sit idle for each request-response roundtrip to complete, but could be doing something else in that intervening period. I've build applications that handle 1000s of concurrent HTTP requests at a time without breaking a sweat, all the while doing lots of meaningful operations while slow network I/O operations are completing.
while True:
data = resp.read(65536)
if not data:
break
yield data
Actually I'm not asking for code, but the principle of the entire http connection.
If I stop the program at one yield, for instance, debugging, where is the rest of my http response data? Are they still in the server, or in my client machine's memory?
If the former one, what does the program do in web server to prevent the data from being flushed to client all by once? Control the stream by TCP sequence?
First of all, it depends on your framework. Normally, for yielded responses the Chunked HTTP Transfer is used. So only data, that was read, is sent to the client. No data is buffered at the server side.
I think it depends on the length of your data,if your data is short, the client read once and get them all, if you stop the program, your data is in client's memory.Otherwise, if your data is too long to read once, it may still in the server side, at this time, you stop the client program, the rest data is not in your program memory.
As unfortunately my previous question got closed for being an "exact copy" of a question while it definitely IS NOT, hereby again.
It is NOT a duplicate of Python: HTTP Post a large file with streaming
That one deals with streaming a big file; I want to send arbitrary chunks of a file one by one to the same http connection. So I have a file of say 20 MB, and what I want to do is open an HTTP connection, then send 1 MB, send another 1 MB, etc, until it's complete. Using the same connection, so the server sees a 20 MB chunk appear over that connection.
Mmapping a file is what I ALSO intend to do, but that does not work when the data is read from stdin. And primarily for that second case I an looking for this part-by-part feeding of data.
Honestly I wonder whether it can be done at all - if not, I'd like to know, then can close the issue. But if it can be done, how could it be done?
From the client’s perspective, it’s easy. You can use httplib’s low-level interface—putrequest, putheader, endheaders, and send—to send whatever you want to the server in chunks of any size.
But you also need to indicate where your file ends.
If you know the total size of the file in advance, you can simply include the Content-Length header, and the server will stop reading your request body after that many bytes. The code may then look like this.
import httplib
import os.path
total_size = os.path.getsize('/path/to/file')
infile = open('/path/to/file')
conn = httplib.HTTPConnection('example.org')
conn.connect()
conn.putrequest('POST', '/upload/')
conn.putheader('Content-Type', 'application/octet-stream')
conn.putheader('Content-Length', str(total_size))
conn.endheaders()
while True:
chunk = infile.read(1024)
if not chunk:
break
conn.send(chunk)
resp = conn.getresponse()
If you don’t know the total size in advance, the theoretical answer is the chunked transfer encoding. Problem is, while it is widely used for responses, it seems less popular (although just as well defined) for requests. Stock HTTP servers may not be able to handle it out of the box. But if the server is under your control too, you could try manually parsing the chunks from the request body and reassembling them into the original file.
Another option is to send each chunk as a separate request (with Content-Length) over the same connection. But you still need to implement custom logic on the server. Moreover, you need to persist state between requests.
Added 2012-12-27. There’s an nginx module that converts chunked requests into regular ones. May be helpful so long as you don’t need true streaming (start handling the request before the client is done sending it).
I tried using a socket for 2 sends. The first one succeeds and the next one does not.
From the http://docs.python.org/howto/sockets.html
it would appear that multiple sends should be allowed. For Better or worse, I don't really need to read from the socket.
I have used twisted, but for the present purpose, I would like to stick to a socket, if I can help it(partly because I am using it within an application already using twisted to communicate.. this is a seperate connection).
"When the connect completes, the socket s can be used to send in a request for the text of the page. The same socket will read the reply, and then be destroyed. That’s right, destroyed. Client sockets are normally only used for one exchange (or a small set of sequential exchanges)."
return value for the send that succeeds = 35
return value for the send that FAILS = 32
code with some minor editing to remove any business logic.
self._commandSock = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
def sendPrereqs(self,id,prereqs):
self._commandSock.connect(self._commandConnection)
#parse prereqs
temp = prereqs.split(',')
for pair in temp:
tup = pair.partition(':')
try:
command = 'some command'
logging.info('sending command: ' + command)
ret = self._commandSock.send(command)
if ret == None:
logging.info('send called successfully: ' + command)
else:
logging.info('socket returned non-None: ' + str(ret))
except:
print 'Unexpected Exception ', sys.exc_info()[0]()
print sys.exc_info()
#logging.info('Unexpected Exception '+ str(sys.exc_info()[0]()))
#logging.info(' ' + str(sys.exc_info()))
self._commandSock.close()`
return value for the send that succeeds = 35 return value for the send that FAILS = 32
Documentation says that successful send should return None.
No it doesn't. Documentation says:
Returns the number of bytes sent. Applications are responsible for checking that all data has been sent; if only some of the data was transmitted, the application needs to attempt delivery of the remaining data. For further information on this concept, consult the Socket Programming HOWTO.
You still haven't explained what you mean by "FAILS". The send call is returning successfully, and it's almost certainly placed 32 bytes into the socket write buffer.
If the only reason you think it's failing is that it returns the correct value, then the answer is obviously that it's not failing.
If something else is going wrong, there are all kinds of things that could be wrong at a higher level. One likely one is this: The server (especially if it was coded by someone who doesn't understand sockets well) is coded to expect one recv() of 35 bytes, and one recv() of 32 bytes. But it's actually getting a single recv() of 67 bytes, and waiting forever for the second, which never comes. There is no rule guaranteeing that each send() on a stream (TCP) socket corresponds to one recv() on the other side. You need to create some kind of stream protocol that demarcates the separate messages.
Meanwhile, the quote you're referring to is irrelevant. It's describing how client sockets are used by simple web browsers: They make a connection, do one send, receive all the data, then destroy the connection. I can see why it misled you, and you may want to file a documentation bug to get it improved. But for now, just ignore it.
If you want to make absolutely sure that the client is sending the right data, and the problem is in the server, there are two easy ways to do that:
Use netcat as a trivial substitute server (e.g., nc -kl 6000, replacing the "6000" with the actual port) and making sure it logs what you think the server should be seeing.
Use Wireshark to watch the connection between the client and server.
Once you've verified that the problem is on the server side, you need to debug the server. If you need help with that, that's probably best done in a new question, where you can post the server code instead of the client, and explain (with a link here) that you're sure the client is sending the right information.
The documentation is only referring to a common scenario. You can call send, sendall, and sendto on all sockets as often as you want - as long as the socket is not closed.
Note that these methods return the number of bytes sent, 32 and 35 simply mean you sent 32 bytes the first and 35 bytes the second time.
The fact that socket.send returns without an exception means that the data got handed to the operating system, but not that it actually reached the endpoint (or has been read correctly by an application there).
I need to get json data and I'm using urllib2:
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'gzip')
opener = urllib2.build_opener()
connection = opener.open(request)
data = connection.read()
but although the data aren't so big it is too slow.
Is there a way to speed it up? I can use 3rd party libraries too.
Accept-Encoding:gzip means that the client is ready to gzip Encoded content if the Server is ready to send it first. The rest of the request goes down the sockets and to over your Operating Systems TCP/IP stack and then to physical layer.
If the Server supports ETags, then you can send a If-None-Match header to ensure that content has not changed and rely on the cache. An example is given here.
You cannot do much with clients only to improve your HTTP request speed.
You're dependant on a number of different things here that may not be within your control:
Latency/Bandwidth of your connection
Latency/Bandwidth of server connection
Load of server application and its individual processes
Items 2 and 3 are probably where the problem lies and you won't be able to do much about it. Is the content cache-able? This will depend on your own application needs and HTTP headers (e.g. ETags, Cache-Control, Last-Modified) that are returned from the server. The server may only up date every day in which case you might be better off only requesting data every hour.
There is unlikely an issue with urllib. If you have network issues and performance problems: consider using tools like Wireshark to investigate on the network level. I have very strong doubts that this is related to Python in any way.
If you are making lots of requests, look into threading. Having about 10 workers making requests can speed things up - you don't grind to a halt if one of them takes too long getting a connection.