I have the basic code (form https://docs.python.org/2/howto/urllib2.html):
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
I would like to get the of the entire request and the size of the entire response, Is there any way?
(haven't seen one for urllib2 or for requests)
"entire" - means including headers and any meta-data that might be sent with it.
Thanks.
res.headers might or might not contain a field provided by the server which says content-length. So int(res.headers['content-length']) will give you the information - if the server provides it.
A very simple implementation of a HTTP stream might not provide this information at all, so you don't know it until you get EOF.
Related
I want to get audio streaming data from server using Python.
I try simple request to audio stream url using urllib:
req = urllib.request.Request(<url>)
but i get exception:
http.client.BadStatusLine: Uª¨Ì5¦`
It looks like server responses and send data without http header including Status code.
Is there any way to get and process response in this case?
Also it is worth to mention results i got to request this URL with clients:
Curl:
curl "http://<server>:81/audiostream.cgi?user=<user>&pwd=<password>&streamid=0&filename=" curl: (1) Received HTTP/0.9 when not allowed
The workaround is use --http0.9 switch.
Chrome/Chromium based browsers shows:
ERR_INVALID_HTTP_RESPONSE
Mozilla Firefox can correctly fetch this data as binary
Screenshot
Can you upload the code fragment? Or maybe you just need to search it on google and SO. I have found several links that have mentioned this problem.
Like:
Why am I getting httplib.BadStatusLine in python?
BadStatusLine exception raised when returning reply from server in Python 3
Why does this url raise BadStatusLine with httplib2 and urllib2?
Issue 42432: Http client, Bad Status Line triggered for no reason
Check again and Think Twice! Search on SO first before start a new thread.
HTTP 0.9 is about the simplest possible http protocol:
The client sends a document request consisting of a line of ASCII characters terminated by a CR LF (carriage return, line feed) pair [...]
This request consists of the word "GET", a space, the document address , omitting the "http:, host and port parts when they are the coordinates just used to make the connection.
The response to a simple GET request is a message in hypertext mark-up language ( HTML ). This is a byte stream of ASCII characters.
source
Thus your server is not sending a valid HTTP 0.9 response, as it's not html. Chrome (etc) is quite within its rights to reject it, although in practice it may not even support http 0.9.
In this case the camera is apparently (ab)using http to start a stream (since presumably it will carry on sending data over the connection, which is also not http 0.9, although not explicitly forbidden). The simplest way to get the data you want is to do it manually:
Create and open a socket with the server's base address
send a GET request for audiostream.cgi?user=<user>&pwd=<password>&streamid=0&filename= (do you really need that last param?)
run socket.recv(max_bytes) in a loop in a thread and transfer to a (thread-safe) buffer, do whatever you want to do with that buffer in another thread.
Alternatively if you're familiar with async programming, use asyncio rather than threads.
You will obviously need to handle decoding the file stream yourself. Hopefully you can identify the format and pass it to a decoder; alternatively something like ffmpeg might be able to guess it.
Have you tried including User-Agent header when doing this request? Sometimes this can be caused by a web-scraping detection.
import urllib2
opener = urllib2.build_opener()
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1',
}
opener.addheaders = headers.items()
response = opener.open(<url>)
I am scraping a variety of pages (the_url) within a large website using the following code:
opener = urllib.request.build_opener()
url = opener.open(the_url)
contents_of_webpage = url.read()
url.close()
contents_of_webpage = contents_of_webpage.decode("utf-8")
This works fine for almost every page but occasionally I get:
urllib.error.HTTPError: HTTP Error 413: Payload Too Large
Looking for solutions I come up against answers of the form: well a web server may choose to give this as a response... as if there was nothing to be done - but all of my browsers can read the page without problem and presumably my browsers should be making the same kind of request. So surely there exists some kind of solution... For example can you ask for a web page a little bit at a time to avoid a large payload?
It depends heavily on the site and the URL you're requesting. To avoid your problem, most sites/APIs offer pagination on their endpoints. Try to check if the endpoint you're requesting accepts GET parameters like ?offset=<int>&limit=<int> or smth.
UPD: besides that, urllib is not so good in emulating browser behavior.
So you could try making the same request using requests, or setting the User-Agent header your browser has.
I am a beginner so I apologize if my question is very obvious or not worded correctly.
I need to send a request to a URL so data can then be sent back in XLM format. The URL will have a user specific login and password, so I need to incorporate that as well. Also there is a port (port 80) that I need to include in the request. Is requests.get the way to go? I'm not exactly sure where to start. After receiving the XLM data, I need to process it (store it) on my machine - if anyone also wants to take a stab at that (I am also struggling to understand exactly how XLM data is sent over, is it an entire file?). Thanks in advance for the help.
Here is a python documentation on how to fetch internet resources using the urllib package.
It talks about getting the data, storing it in a file, sending data and some basic authentication.
https://docs.python.org/3/howto/urllib2.html
Getting the URL would look something like this import
Import urllib.request
urllib.request.urlopen("http://yoururlhere.co.uk").read()
Note that this is for strings and Python 3 only.
Python 2 version can be found here
What is the quickest way to HTTP GET in Python?
If you want to parse the data you may want to use this
https://docs.python.org/2/library/xml.etree.elementtree.html
I hope this helps! I am not too sure on how you would approach the username and password stuff but these links can hopefully provide you with information on how to do some of the other stuff!
Import the requests library and then call the post method as follows:
import requests
data = {
"email" : "netsparkertest#test.com",
"password" : "abcd12333",
}
r = requests.post('www.facebook.com', data=data)
print r.text
print r.status_code
print r.content
print r.headers
I have a service to which I need to upload the content and the server starts sending the response after it gets certain amount of data, while my request body is still uploading.
headers = {'Content-Type': 'application/octet-stream', 'Expect': '100-continue', 'Connection' :'keep-alive'}
url = "https://MY_API_URL/WEBSERVICE"
response = requests.put(url, headers=headers,stream=True, data=data_gen(fh))
lines = response.iter_lines()
for line in lines:
print line
data_gen is my generator function which takes a file handle of a very large file that yields 4KB per iteration.
My problem is that I dont get the "response" until the whole file uploads. Any ideas on how I can overcome this.
You cannot accomplish this with requests today. Requests (and the underlying libraries, including httplib/http.client [depending on your version of Python]) all send all of the data before they start reading the response.
One library that may be able to handle this (in fact, I'm fairly certain this should be doable with it) is treq. It uses Twisted which should give you ways to determine when data is received so all you should need to do is register a callback to start accessing that data.
I have been having problems with a script I am developing whereby I am receiving no output and the memory usage of the script is getting larger and larger over time. I have figured out the problem lies with some of the URLs I am checking with the Requests library. I am expecting to download a webpage however I download a large file instead. All this data is then stored in memory causing my issues.
What I want to know is; is there any way with the requests library to check what is being downloaded? With wget I can see: Length: 710330974 (677M) [application/zip].
Is this information available in the headers with requests? If so is there a way of terminating the download upon figuring out it is not a HTML webpage?
Thanks in advance.
Yes, the headers can tell you a lot about the page, most pages will include a Content-Length header.
By default, however, the request is downloaded in its entirety before the .get() or .post(), etc. call returns. Set the stream=True keyword to defer loading the response:
response = requests.get(url, stream=True)
Now you can inspect the headers and just discard the request if you don't like what you find:
length = int(response.headers.get('Content-Length', 0))
if length > 1048576:
print 'Response larger than 1MB, discarding
Subsequently accessing the .content or .text attributes, or the .json() method will trigger a full download of the response.