Python requests reading response while uploading request body - python

I have a service to which I need to upload the content and the server starts sending the response after it gets certain amount of data, while my request body is still uploading.
headers = {'Content-Type': 'application/octet-stream', 'Expect': '100-continue', 'Connection' :'keep-alive'}
url = "https://MY_API_URL/WEBSERVICE"
response = requests.put(url, headers=headers,stream=True, data=data_gen(fh))
lines = response.iter_lines()
for line in lines:
print line
data_gen is my generator function which takes a file handle of a very large file that yields 4KB per iteration.
My problem is that I dont get the "response" until the whole file uploads. Any ideas on how I can overcome this.

You cannot accomplish this with requests today. Requests (and the underlying libraries, including httplib/http.client [depending on your version of Python]) all send all of the data before they start reading the response.
One library that may be able to handle this (in fact, I'm fairly certain this should be doable with it) is treq. It uses Twisted which should give you ways to determine when data is received so all you should need to do is register a callback to start accessing that data.

Related

How to check HTTP status of a file online without fully downloading the file?

I have a database of thousands of files online, and I want to check what their status is (e.g. if the file exists, if it sends us to a 404, etc.) and update this in my database.
I've used urllib.request to download files to a python script. However, obviously downloading terabytes of files is going to take a long time. Parallelizing the process would help, but ultimately I just don't want to download all the data, just check the status. Is there an ideal way to check (using urllib or another package) the HTTP response code of a certain URL?
Additionally, if I can get the file size from the server (which would be in the HTTP response), then I can also update this in my database.
If your web server is standards-based, you can use a HEAD request instead of a GET. It returns the same status without actually fetching the page.
The requests module can check the status response of a request.
Just do:
import requests
url = 'https://www.google.com' # Change to your link
response = requests.get(url)
print(response.status_code)
this code shows me 200, so the request has been successful

GET request with files argument in requests library Python

I find this code and I really don't understand it, how is it possible to send data (not query) with GET request
response = requests.get(
check_all_info_url_2, files=multipart_form_data, timeout=30)
and what is files= argument in the get request.
Since requests.get is just a wrapper function this will just call requests.request. Unless requests.session implementes any checking, it will happily send off a GET request with multipart data in it.
Is this valid? Not to my knowledge, although I'm willing to be proven wrong. No api I have ever written would accept file upload on a GET request. But not every server will even check the method, so perhaps this code is interacting with a badly written server which doesn't reject for wrong method, or perhap's it's even interacting with a worse server which expects file upload with GET. There are lots of broken servers out there ;)
In any case, the reason this works with requests is that it just passes keyword arguments through to the underlying session without performing any kind of validation.

Sort header from HTTP response before writing it to a file

I am currently trying to implement a HTTP client using python and sockets. It is very simple and the only thing it has to do is to download a file from a webserver and put it into a file supplied by the user.
My code is working fine but I am having a problem of how to exclude the HTTP response header from the file.
The HTTP response header is only at the beginning of the file so I was thinking that I could just dump all the data into the file and then take the header out after. This is a problem though since I/O is very slow.
My next thought was that I could run some Regex on the first response I get from the server, sort away the header and then dump the rest into the file. This seems as a very clunky way to do it though.
Does anyone have any suggestions on how to do this in a smart way?
In the http response, the headers are separated from the body with '\r\n\r\n'. To get only the body, you can try this:
bodyBegin = httpResponse.find('\r\n\r\n') + 4
body = httpResponse[bodyBegin:]
saveToFile(body)

How to get size of transmission using urllib2?

I have the basic code (form https://docs.python.org/2/howto/urllib2.html):
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
I would like to get the of the entire request and the size of the entire response, Is there any way?
(haven't seen one for urllib2 or for requests)
"entire" - means including headers and any meta-data that might be sent with it.
Thanks.
res.headers might or might not contain a field provided by the server which says content-length. So int(res.headers['content-length']) will give you the information - if the server provides it.
A very simple implementation of a HTTP stream might not provide this information at all, so you don't know it until you get EOF.

Checking if a file is being downloaded by Python Requests library

I have been having problems with a script I am developing whereby I am receiving no output and the memory usage of the script is getting larger and larger over time. I have figured out the problem lies with some of the URLs I am checking with the Requests library. I am expecting to download a webpage however I download a large file instead. All this data is then stored in memory causing my issues.
What I want to know is; is there any way with the requests library to check what is being downloaded? With wget I can see: Length: 710330974 (677M) [application/zip].
Is this information available in the headers with requests? If so is there a way of terminating the download upon figuring out it is not a HTML webpage?
Thanks in advance.
Yes, the headers can tell you a lot about the page, most pages will include a Content-Length header.
By default, however, the request is downloaded in its entirety before the .get() or .post(), etc. call returns. Set the stream=True keyword to defer loading the response:
response = requests.get(url, stream=True)
Now you can inspect the headers and just discard the request if you don't like what you find:
length = int(response.headers.get('Content-Length', 0))
if length > 1048576:
print 'Response larger than 1MB, discarding
Subsequently accessing the .content or .text attributes, or the .json() method will trigger a full download of the response.

Categories

Resources