Constipated Python urllib2 sockets

Constipated Python urllib2 sockets - python

I've been scouring the Internet looking for a solution to my problem with Python. I'm trying to use a urllib2 connection to read a potentially endless stream of data from an HTTP server. It's part of some interactive communication, so it's important that I can get the data that's available, even if it's not a whole buffer full. There seems to be no way to have read \ readline return the available data. It will block forever waiting for the entire (endless) stream before it returns.
Even if I set the underlying file descriptor to non-blocking using fnctl, the urllib2 file-object still blocks!! In general there seems to be no way to make python file-objects, upon read, return all available data if there is some and block otherwise.
I've seen a few posts about people seeking help with this, but I have seen no solutions. What gives? Am I missing something? This seems like such a normal use-case to completely ruin! I'm hoping to utilize urllib2's ability to detect configured proxies and use chunked encoding, but I can't if it won't cooperate.
Edit: Upon request, here is some example code
Client:
connection = urllib2.urlopen(commandpath)
id = connection.readline()
Now suppose that the server is using chunked transfer encoding, and writes one chunk down the stream and the chunk contains the line, and then waits. The connection is still open, but the client has data waiting in a buffer.
I cannot get read or readline to return the data I know it has waiting for it, because it tries to read until the end of the connection. In this case the connection may never close so it will wait either forever or until an inactivity timeout occurs, severing the connection. Once the connection is severed it will return, but that's obviously not the behavior I want.

urllib2 operates at the HTTP level, which works with complete documents. I don't think there's a way around that without hacking into the urllib2 source code.
What you can do is use plain sockets (you'll have to talk HTTP yourself in this case), and call sock.recv(maxbytes) which does read only available data.
Update: you may want to try to call conn.fp._sock.recv(maxbytes), instead of conn.read(bytes) on an urllib2 connection.

Related

What could cause the response body to be cut off (on client-side)?

I am writing a python language plugin for an active code generator that makes calls to our Rest API. After making many attempts to use the requests library and failing, I opted to use the much lower level socket and ssl modules, which have been working fine so far. I am using a very crude method to parse the responses; for fairly short responses in the body, this works fine, but I am now trying to retrieve much larger json objects (lists of users). The response is being cut off as follows (note: I removed a couple user entries for the sake of brevity):
{"page-start":1,"total":5,"userlist":[{"userid":"jim.morrison","first-name":"Jim","last-name":"Morrison","language":"English","timezone":"(GMT+5:30)CHENNAI,KOLKATA,MUMBAI,NEW DELHI","currency":"US DOLLAR","roles":
There should be a few more users after this and the response body is on a single line in the console.
Here is the code I am using to request the user list from the Rest API server:
import socket, ssl, json
host = self.WrmlClientSession.api_host
port = 8443
pem_file = "<pem file>"
url = self.WrmlClientSession.buildURI(host, port, '<root path>')
#Create the header
http_header = 'GET {0} HTTP/1.1\n\n'
req = http_header.format(url)
#Socket configuration and connection execution
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
conn = ssl.wrap_socket(sock, ca_certs = pem_file)
conn.connect((host, port))
conn.send(req)
response = conn.recv()
(headers, body) = response.split("\r\n\r\n")
#Here I would convert the body into a json object, but because the response is
#cut off, it cannot be properly decoded.
print(response)
Any insight into this matter would be greatly appreciated!
Edit: I forgot to mention that I debugged the response on the server-side, and everything was perfectly normal.

You can't assume that you can just call recv() once and get all the data since the TCP connection will only buffer a limited amount. Also, you're not parsing any of the headers to determine the size of body that you're expecting. You could use a non-blocking socket and keep reading until it blocks, which will mostly work but is simply not reliable and quite poor practice so I'm not going to document it here.
HTTP has ways of indicating the size of the body for exactly this reason and the correct approach is to use them if you want your code to be reliable. There are two things to look for. Firstly, if the HTTP response has a Content-Length then that indicates how many bytes will occur in the response body - you need to keep reading until you've had that much. The second option is that the server may send you a response which uses chunked encoding - it indicates this by including a Transfer-Encoding header whose value will contain the text chunked. I won't go into chunked encoding here, read the wikipedia article for details. In essence the body contains small headers for each "chunk" of data which indicate the size of that chunk. In this case you have to keep reading chunks until you get an empty one, which indicates the end of the response. This approach is used instead of Content-Length when the size of the response body isn't known by the server when it starts to send it.
Typically a server won't use both Content-Length and chunked encoding, but there's nothing to actually stop it so that's also something to consider. If you only need to interoperate with a specific server then you can just tell what it does and work with that, but be aware you'll be making your code less portable and more fragile to future changes.
Note that when using these headers, you'll still need to read in a loop because any given read operation may return incomplete data - TCP is designed to stop sending data until the reading application has started to empty the buffer, so this isn't something you can work around. Also note that each read may not even contain a complete chunk, so you need to keep track of state about the size of the current chunk and the amount of it you've already seen. You only know to read the next chunk header when you've seen the number of bytes specified by the previous chunk header.
Of course, you don't have to worry about any of this if you use any of Python's myriad of HTTP libraries. Speaking as someone who's had to implement a fairly complete HTTP/1.1 client before, you really want to let someone else do it if you possibly can - there's quite a few tricky corner cases to consider and your simple code above is going to fail in a lot of cases. If requests doesn't work for you, have you tried any of the standard Python libraries? There's urllib and urllib2 for higher level interfaces and httplib provides a lower-level approach which you might find allows you to work around some of your problems.
Remember that you can always modify the code in these (after copying to your local repository of course) if you really have to fix issues, or possibly just import them and monkey-patch your changes in. You'd have to be quite clear it was an issue in the library and not just a mistaken use of it, though.
If you really want to implement a HTTP client that's fine, but just be aware that it's harder than it looks.
As a final aside, I've always used the read() method of SSL sockets instead of recv() - I'd hope they'd be equivalent, but you may wish to try that if you're still having issues.

Streaming data of unknown size from client to server over HTTP in Python

As unfortunately my previous question got closed for being an "exact copy" of a question while it definitely IS NOT, hereby again.
It is NOT a duplicate of Python: HTTP Post a large file with streaming
That one deals with streaming a big file; I want to send arbitrary chunks of a file one by one to the same http connection. So I have a file of say 20 MB, and what I want to do is open an HTTP connection, then send 1 MB, send another 1 MB, etc, until it's complete. Using the same connection, so the server sees a 20 MB chunk appear over that connection.
Mmapping a file is what I ALSO intend to do, but that does not work when the data is read from stdin. And primarily for that second case I an looking for this part-by-part feeding of data.
Honestly I wonder whether it can be done at all - if not, I'd like to know, then can close the issue. But if it can be done, how could it be done?

From the client’s perspective, it’s easy. You can use httplib’s low-level interface—putrequest, putheader, endheaders, and send—to send whatever you want to the server in chunks of any size.
But you also need to indicate where your file ends.
If you know the total size of the file in advance, you can simply include the Content-Length header, and the server will stop reading your request body after that many bytes. The code may then look like this.
import httplib
import os.path
total_size = os.path.getsize('/path/to/file')
infile = open('/path/to/file')
conn = httplib.HTTPConnection('example.org')
conn.connect()
conn.putrequest('POST', '/upload/')
conn.putheader('Content-Type', 'application/octet-stream')
conn.putheader('Content-Length', str(total_size))
conn.endheaders()
while True:
chunk = infile.read(1024)
if not chunk:
break
conn.send(chunk)
resp = conn.getresponse()
If you don’t know the total size in advance, the theoretical answer is the chunked transfer encoding. Problem is, while it is widely used for responses, it seems less popular (although just as well defined) for requests. Stock HTTP servers may not be able to handle it out of the box. But if the server is under your control too, you could try manually parsing the chunks from the request body and reassembling them into the original file.
Another option is to send each chunk as a separate request (with Content-Length) over the same connection. But you still need to implement custom logic on the server. Moreover, you need to persist state between requests.
Added 2012-12-27. There’s an nginx module that converts chunked requests into regular ones. May be helpful so long as you don’t need true streaming (start handling the request before the client is done sending it).

Python: Stop the socket-receiving-process

I receive data from some device via socket-module.
But after some time the device stops sending packages.
Then I want to interupt the for-loop.
While True doesn't work, because he receives more then 100 packages.
How can I stop this process?
s stands for socket.
...
for i in range(packages100):
data = s.recv(4)
f.write(data)
...
Edit:
I think socket.settimeout() is part of the solution. See also:
How to set timeout on python's socket recv method?

If your peer really just stops sending data, as opposed to closing the connection, this is tricky and you'll be forced to resort to asynchronous reading from this socket.
Put it in asynchronous mode (the docs and Google are your friends), and try to read it each time, instead of the blocking read. You can then just stop "trying" anytime you wish. Note that by nature of async IO your code will be a bit different - you will no longer be able to assume that once recv returns, it actually read some data.

while 1:
data = conn.recv(4)
if not data: break
f.write(data)
Also, example in python docs

Should sockets be non-blocking to work with select in Python?

Should sockets be set to non-blocking when used with select.select in Python?
What difference does it make if they are or aren't?
Occasionally I find that calling send on a socket that returns sendable will block. Furthermore I find that blocking sockets will generally send the whole buffer given (128 KiB). In non-blocking mode, sending will accept far fewer bytes (20-40 KiB compared with the example given earlier) and return quicker. I'm using Python 3.1 on Lucid.

The answer might be OS dependent unfortunately. I'm replying only regarding Linux.
I'm not aware of differences regarding blocking/non-blocking sockets in select, but on linux, the select system call man page has this in it 'BUGS' section:
Under Linux, select() may report a
socket file descriptor as "ready for
reading", while nevertheless a
subsequent read blocks. This could
for example happen when data has
arrived but upon examination has
wrong checksum and is discarded. There may be other
circumstances in which a file
descriptor is spuriously reported as
ready. Thus it may be safer to use
O_NONBLOCK on sockets that should not
block.
I doubt a python abstraction above that could "hide" this issue without side-effects.
As for the blocking write sending more data, that's expected. send will block until there is enough buffer space to pass your whole request down if the socket is blocking. If the socket is non-blocking, it only sends as much as can currently fit in the socket's send buffer.

Checking files retrieved by Twisted's FTPClient.retrieveFile method for completeness

I'm writing a custom ftp client to act as a gatekeeper for incoming multimedia content from subcontractors hired by one of our partners. I chose twisted because it allows me to parse the file contents before writing the files to disk locally, and I've been looking for occasion to explore twisted anyway. I'm using 'twisted.protocols.ftp.FTPClient.retrieveFile' to get the file, passing the escaped path to the file, and a protocol to the 'retrieveFile' method. I want to be absolutely sure that the entire file has been retrieved because the event handler in the call back is going to write the file to disk locally, then delete the remote file from the ftp server alla '-E' switch behavior in the lftp client. My question is, do I really need to worry about this, or can I assume that an err back will happen if the file is not fully retrieved?

There are a couple unit tests for behavior in this area.
twisted.test.test_ftp.FTPClientTestCase.test_failedRETR is the most directly relevant one. It covers the case where the control and data connections are lost while a file transfer is in progress.
It seems to me that test coverage in this area could be significantly improved. There are no tests covering the case where just the data connection is lost while a transfer is in progress, for example. One thing that makes this tricky, though, is that FTP is not a very robust protocol. The end of a file transfer is signaled by the data connection closing. To be safe, you have to check to see if you received as many bytes as you expected to receive. The only way to perform this check is to know the file size in advance or ask the server for it using LIST (FTPClient.list).
Given all this, I'd suggest that when a file transfer completes, you always ask the server how many bytes you should have gotten and make sure it agrees with the number of bytes delivered to your protocol. You may sometimes get an errback on the Deferred returned from retrieveFile, but this will keep you safe even in the cases where you don't.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Constipated Python urllib2 sockets - python

Related

What could cause the response body to be cut off (on client-side)?

Streaming data of unknown size from client to server over HTTP in Python

Python: Stop the socket-receiving-process

Should sockets be non-blocking to work with select in Python?

Checking files retrieved by Twisted's FTPClient.retrieveFile method for completeness

Categories

Resources