I need to get json data and I'm using urllib2:
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'gzip')
opener = urllib2.build_opener()
connection = opener.open(request)
data = connection.read()
but although the data aren't so big it is too slow.
Is there a way to speed it up? I can use 3rd party libraries too.
Accept-Encoding:gzip means that the client is ready to gzip Encoded content if the Server is ready to send it first. The rest of the request goes down the sockets and to over your Operating Systems TCP/IP stack and then to physical layer.
If the Server supports ETags, then you can send a If-None-Match header to ensure that content has not changed and rely on the cache. An example is given here.
You cannot do much with clients only to improve your HTTP request speed.
You're dependant on a number of different things here that may not be within your control:
Latency/Bandwidth of your connection
Latency/Bandwidth of server connection
Load of server application and its individual processes
Items 2 and 3 are probably where the problem lies and you won't be able to do much about it. Is the content cache-able? This will depend on your own application needs and HTTP headers (e.g. ETags, Cache-Control, Last-Modified) that are returned from the server. The server may only up date every day in which case you might be better off only requesting data every hour.
There is unlikely an issue with urllib. If you have network issues and performance problems: consider using tools like Wireshark to investigate on the network level. I have very strong doubts that this is related to Python in any way.
If you are making lots of requests, look into threading. Having about 10 workers making requests can speed things up - you don't grind to a halt if one of them takes too long getting a connection.
Related
The problem: I need to send many HTTP requests to a server. I can only use one connection (non-negotiable server limit). The server's response time plus the network latency is too high – I'm falling behind.
The requests typically don't change server state and don't depend on the previous request's response. So my idea is to simply send them on top of each other, enqueue the response objects, and depend on the Content-Length: of the incoming responses to feed incoming replies to the next-waiting response object. In other words: Pipeline the requests to the server.
This is of course not entirely safe (any reply without Content-Length: means trouble), but I don't care -- in that case I can always retry any queued requests. (The safe way would be to wait for the header before sending the next bit. That'd might help me enough. No way to test beforehand.)
So, ideally I want the following client code (which uses client delays to mimic network latency) to run in three seconds.
Now for the $64000 question: Is there a Python library which already does this, or do I need to roll my own? My code uses gevent; I could use Twisted if necessary, but Twisted's standard connection pool does not support pipelined requests. I also could write a wrapper for some C library if necessary, but I'd prefer native code.
#!/usr/bin/python
import gevent.pool
from gevent import sleep
from time import time
from geventhttpclient import HTTPClient
url = 'http://local_server/100k_of_lorem_ipsum.txt'
http = HTTPClient.from_url(url, concurrency=1)
def get_it(http):
print time(),"Queueing request"
response = http.get(url)
print time(),"Expect header data"
# Do something with the header, just to make sure that it has arrived
# (the greenlet should block until then)
assert response.status_code == 200
assert response["content-length"] > 0
for h in response.items():
pass
print time(),"Wait before reading body data"
# Now I can read the body. The library should send at
# least one new HTTP request during this time.
sleep(2)
print time(),"Reading body data"
while response.read(10000):
pass
print time(),"Processing my response"
# The next request should definitely be transmitted NOW.
sleep(1)
print time(),"Done"
# Run parallel requests
pool = gevent.pool.Pool(3)
for i in range(3):
pool.spawn(get_it, http)
pool.join()
http.close()
Dugong is an HTTP/1.1-only client which claims to support real HTTP/1.1 pipelining. The tutorial includes several examples on how to use it, including one using threads and another using asyncio.
Be sure to verify that the server you're communicating with actually supports HTTP/1.1 pipelining—some servers claim to support HTTP/1.1 but don't implement pipelining.
I think txrequests could get you most of what you are looking for, using the background_callback to en-queue processing of responses on a separate thread. Each request would still be it's own thread but using a session means by default it would reuse the same connection.
https://github.com/tardyp/txrequests#working-in-the-background
It seems you are running python2.
For python3 >= 3.5
you could use async/await loop
See asyncio
Also, there is a library built on top for better, easier use
called Trio, available on pip.
Another thing I can think of is multiple threads with locks.
I will think on how to better explain this or could it even work.
I am writing a python language plugin for an active code generator that makes calls to our Rest API. After making many attempts to use the requests library and failing, I opted to use the much lower level socket and ssl modules, which have been working fine so far. I am using a very crude method to parse the responses; for fairly short responses in the body, this works fine, but I am now trying to retrieve much larger json objects (lists of users). The response is being cut off as follows (note: I removed a couple user entries for the sake of brevity):
{"page-start":1,"total":5,"userlist":[{"userid":"jim.morrison","first-name":"Jim","last-name":"Morrison","language":"English","timezone":"(GMT+5:30)CHENNAI,KOLKATA,MUMBAI,NEW DELHI","currency":"US DOLLAR","roles":
There should be a few more users after this and the response body is on a single line in the console.
Here is the code I am using to request the user list from the Rest API server:
import socket, ssl, json
host = self.WrmlClientSession.api_host
port = 8443
pem_file = "<pem file>"
url = self.WrmlClientSession.buildURI(host, port, '<root path>')
#Create the header
http_header = 'GET {0} HTTP/1.1\n\n'
req = http_header.format(url)
#Socket configuration and connection execution
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
conn = ssl.wrap_socket(sock, ca_certs = pem_file)
conn.connect((host, port))
conn.send(req)
response = conn.recv()
(headers, body) = response.split("\r\n\r\n")
#Here I would convert the body into a json object, but because the response is
#cut off, it cannot be properly decoded.
print(response)
Any insight into this matter would be greatly appreciated!
Edit: I forgot to mention that I debugged the response on the server-side, and everything was perfectly normal.
You can't assume that you can just call recv() once and get all the data since the TCP connection will only buffer a limited amount. Also, you're not parsing any of the headers to determine the size of body that you're expecting. You could use a non-blocking socket and keep reading until it blocks, which will mostly work but is simply not reliable and quite poor practice so I'm not going to document it here.
HTTP has ways of indicating the size of the body for exactly this reason and the correct approach is to use them if you want your code to be reliable. There are two things to look for. Firstly, if the HTTP response has a Content-Length then that indicates how many bytes will occur in the response body - you need to keep reading until you've had that much. The second option is that the server may send you a response which uses chunked encoding - it indicates this by including a Transfer-Encoding header whose value will contain the text chunked. I won't go into chunked encoding here, read the wikipedia article for details. In essence the body contains small headers for each "chunk" of data which indicate the size of that chunk. In this case you have to keep reading chunks until you get an empty one, which indicates the end of the response. This approach is used instead of Content-Length when the size of the response body isn't known by the server when it starts to send it.
Typically a server won't use both Content-Length and chunked encoding, but there's nothing to actually stop it so that's also something to consider. If you only need to interoperate with a specific server then you can just tell what it does and work with that, but be aware you'll be making your code less portable and more fragile to future changes.
Note that when using these headers, you'll still need to read in a loop because any given read operation may return incomplete data - TCP is designed to stop sending data until the reading application has started to empty the buffer, so this isn't something you can work around. Also note that each read may not even contain a complete chunk, so you need to keep track of state about the size of the current chunk and the amount of it you've already seen. You only know to read the next chunk header when you've seen the number of bytes specified by the previous chunk header.
Of course, you don't have to worry about any of this if you use any of Python's myriad of HTTP libraries. Speaking as someone who's had to implement a fairly complete HTTP/1.1 client before, you really want to let someone else do it if you possibly can - there's quite a few tricky corner cases to consider and your simple code above is going to fail in a lot of cases. If requests doesn't work for you, have you tried any of the standard Python libraries? There's urllib and urllib2 for higher level interfaces and httplib provides a lower-level approach which you might find allows you to work around some of your problems.
Remember that you can always modify the code in these (after copying to your local repository of course) if you really have to fix issues, or possibly just import them and monkey-patch your changes in. You'd have to be quite clear it was an issue in the library and not just a mistaken use of it, though.
If you really want to implement a HTTP client that's fine, but just be aware that it's harder than it looks.
As a final aside, I've always used the read() method of SSL sockets instead of recv() - I'd hope they'd be equivalent, but you may wish to try that if you're still having issues.
As unfortunately my previous question got closed for being an "exact copy" of a question while it definitely IS NOT, hereby again.
It is NOT a duplicate of Python: HTTP Post a large file with streaming
That one deals with streaming a big file; I want to send arbitrary chunks of a file one by one to the same http connection. So I have a file of say 20 MB, and what I want to do is open an HTTP connection, then send 1 MB, send another 1 MB, etc, until it's complete. Using the same connection, so the server sees a 20 MB chunk appear over that connection.
Mmapping a file is what I ALSO intend to do, but that does not work when the data is read from stdin. And primarily for that second case I an looking for this part-by-part feeding of data.
Honestly I wonder whether it can be done at all - if not, I'd like to know, then can close the issue. But if it can be done, how could it be done?
From the client’s perspective, it’s easy. You can use httplib’s low-level interface—putrequest, putheader, endheaders, and send—to send whatever you want to the server in chunks of any size.
But you also need to indicate where your file ends.
If you know the total size of the file in advance, you can simply include the Content-Length header, and the server will stop reading your request body after that many bytes. The code may then look like this.
import httplib
import os.path
total_size = os.path.getsize('/path/to/file')
infile = open('/path/to/file')
conn = httplib.HTTPConnection('example.org')
conn.connect()
conn.putrequest('POST', '/upload/')
conn.putheader('Content-Type', 'application/octet-stream')
conn.putheader('Content-Length', str(total_size))
conn.endheaders()
while True:
chunk = infile.read(1024)
if not chunk:
break
conn.send(chunk)
resp = conn.getresponse()
If you don’t know the total size in advance, the theoretical answer is the chunked transfer encoding. Problem is, while it is widely used for responses, it seems less popular (although just as well defined) for requests. Stock HTTP servers may not be able to handle it out of the box. But if the server is under your control too, you could try manually parsing the chunks from the request body and reassembling them into the original file.
Another option is to send each chunk as a separate request (with Content-Length) over the same connection. But you still need to implement custom logic on the server. Moreover, you need to persist state between requests.
Added 2012-12-27. There’s an nginx module that converts chunked requests into regular ones. May be helpful so long as you don’t need true streaming (start handling the request before the client is done sending it).
EDIT:Question Updated. Thanks Slott.
I have a TCP Server in Python.
It is a server with asynchronous behaviour. .
The message format is Binary Data.
Currently I have a python client that interacts with the code.
What I want to be able to do eventually implement a Web based Front End to this client.
I just wanted to know , what should be correct design for such an application.
Start with any WSGI-based web server. werkzeug is a choice.
The Asynchronous TCP/IP is a seriously complicated problem. HTTP is synchronous. So using the synchronous web server presenting some asynchronous data is always a problem. Always.
The best you can do is to buffer things and have two processes in your web application.
TCP/IP process that collects data from the remove server and buffers it in a file (or files) somewhere.
WSGI web process which handles GET/POST processing.
GET requests will fetch some or all of the buffer and display it.
POST requests will send a message to the TCP/IP server.
For Web-based, talk HTTP. Use JSON or XML as data formats.
Be standards-compliant and make use of the vast number of libraries out there. Don't reinvent the wheel. This way you have less headaches in the long run.
if you need to maintain a connection to a backend server across multiple HTTP requests, Twisted's HTTP server is an ideal choice, since it's built to manage multiple connections easily.
I am coding a python (2.6) interface to a web service. I need to communicate via http so that :
Cookies are handled automatically,
The requests are asynchronous,
The order in which the requests are sent is respected (the order in which the responses to these requests are received does not matter).
I have tried what could be easily derived from the build-in libraries, facing different problems :
Using httplib and urllib2, the requests are synchronous unless I use thread, in which case the order is not guaranteed to be respected,
Using asyncore, there was no library to automatically deal with cookies send by the web service.
After some googling, it seems that there are many examples of python scripts or libraries that match 2 out of the 3 criteria, but not the 3 of them. I am thinking of reading through the cookielib sources and adapting what I need of it to asyncore (or only to my application in a ad hoc manner), but it seems strange that nothing like this exists yet, as I guess I am not the only one interested. If anyone knows of pointers about this problem, it would be greatly appreciated.
Thank you.
Edit to clarify :
What I am doing is a local proxy that interfaces my IRC client with a webchat. It creates a socket that listens to IRC connections, then upon receiving one, it logs in the webchat via http. I don't have access to the behaviour of the webchat, and it uses cookies for session IDs. When client sends several IRC requests to my python proxy, I have to forward them to the webchat's server via http and with cookies. I also want to do this asynchronously (I don't want to wait for the http response before I send the next request), and currently what happens is that the order in which the http requests are sent is not the order in which the IRC commands were received.
I hope this clarifies the question, and I will of course detail more if it doesn't.
Using httplib and urllib2, the
requests are synchronous unless I use
thread, in which case the order is not
guaranteed to be respected
How would you know that the order has been respected unless you get your response back from the first connection before you send the response to the second connection? After all, you don't care what order the responses come in, so it's very possible that the responses come back in the order you expect but that your requests were processed in the wrong order!
The only way you can guarantee the ordering is by waiting for confirmation that the first request has successfully arrived (eg. you start receiving the response for it) before beginning the second request. You can do this by not launching the second thread until you reach the response handling part of the first thread.