Access HTML code in HTTP response with scapy

Access HTML code in HTTP response with scapy - python

I have a program that uses scapy to sniff data, I'm trying to access the html returned in the http response, I can access all the headers and the response body which includes the html BUT it appears as ��Z��}ks۸���[u��ܵ�#J��/ɴ+q2��&3��s�.
Accessing packet[Raw].load returns the above result.
Now looking at the headers I can see that this is compressed with gzip, which explains why its being displayed like this, so I tried decompressing it with as GzipFile and using zlib but in both cases I got an error message stating this is not a gzip file.
Any help on decompressing it properly??
UPDATE: I noticed that the main issue is that I am trying to decompress part of the string, so the http response is being sent as chunks and the decompress method is failing because I am trying to decompress each chunk separately, if I combine all the chunks I am able to decompress using zlib and gzip, but the same question remains, can I decompress the chunks one at a time before combining them ?

Well, you can do it with gzip module:
body_stream = StringIO.StringIO(body)
gzipper = gzip.GzipFile(fileobj=body_stream)
data = gzipper.read()
print data[:25]

Related

Sort header from HTTP response before writing it to a file

I am currently trying to implement a HTTP client using python and sockets. It is very simple and the only thing it has to do is to download a file from a webserver and put it into a file supplied by the user.
My code is working fine but I am having a problem of how to exclude the HTTP response header from the file.
The HTTP response header is only at the beginning of the file so I was thinking that I could just dump all the data into the file and then take the header out after. This is a problem though since I/O is very slow.
My next thought was that I could run some Regex on the first response I get from the server, sort away the header and then dump the rest into the file. This seems as a very clunky way to do it though.
Does anyone have any suggestions on how to do this in a smart way?

In the http response, the headers are separated from the body with '\r\n\r\n'. To get only the body, you can try this:
bodyBegin = httpResponse.find('\r\n\r\n') + 4
body = httpResponse[bodyBegin:]
saveToFile(body)

How to get size of transmission using urllib2?

I have the basic code (form https://docs.python.org/2/howto/urllib2.html):
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
I would like to get the of the entire request and the size of the entire response, Is there any way?
(haven't seen one for urllib2 or for requests)
"entire" - means including headers and any meta-data that might be sent with it.
Thanks.

res.headers might or might not contain a field provided by the server which says content-length. So int(res.headers['content-length']) will give you the information - if the server provides it.
A very simple implementation of a HTTP stream might not provide this information at all, so you don't know it until you get EOF.

Checking if a file is being downloaded by Python Requests library

I have been having problems with a script I am developing whereby I am receiving no output and the memory usage of the script is getting larger and larger over time. I have figured out the problem lies with some of the URLs I am checking with the Requests library. I am expecting to download a webpage however I download a large file instead. All this data is then stored in memory causing my issues.
What I want to know is; is there any way with the requests library to check what is being downloaded? With wget I can see: Length: 710330974 (677M) [application/zip].
Is this information available in the headers with requests? If so is there a way of terminating the download upon figuring out it is not a HTML webpage?
Thanks in advance.

Yes, the headers can tell you a lot about the page, most pages will include a Content-Length header.
By default, however, the request is downloaded in its entirety before the .get() or .post(), etc. call returns. Set the stream=True keyword to defer loading the response:
response = requests.get(url, stream=True)
Now you can inspect the headers and just discard the request if you don't like what you find:
length = int(response.headers.get('Content-Length', 0))
if length > 1048576:
print 'Response larger than 1MB, discarding
Subsequently accessing the .content or .text attributes, or the .json() method will trigger a full download of the response.

How to send binary post data via HTTP?

I already have binary data read from a file. Most of the examples I see online link directly to the file, and upload the whole file. I am looking how to upload the binary data that I already have from another source via HTTP POST in python.

Alternatively:
req = urllib2.Request("http://example.com", data, {'Content-Type': 'application/octet-stream'})
urllib2.urlopen(req)
That also shows how you can specify the Content-Type of the data.

I'm not sure what online examples you're looking at, but urllib2.urlopen takes the data to post as a chunk of data and not a file at all.

Http protocol, Content-Length, get page content Python

I'm trying to code my own Python 3 http library to learn more about sockets and the Http protocol. My question is, if a do a recv(bytesToRead) using my socket, how can I get only the header and then with the Content-Length information, continue recieving the page content? Isn't that the purpose of the Content-Length header?
Thanks in advance

In the past to accomplish this, I will read a portion of the socket data into memory, and then read from that buffer until a "\r\n\r\n" sequence is encountered (you could use a state machine to do this or simply use the string.find() function. Once you reach that sequence you know all of the headers have been read and you can do some parsing of the headers and then read the entire content length. You may need to be prepared to read a response that does not include a content-length header since not all responses contain it.
If you run out of buffer before seeing that sequence, simply read more data from the socket into your buffer and continue processing.
I can post a C# example if you would like to look at it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Access HTML code in HTTP response with scapy - python

Well, you can do it with gzip module: body_stream = StringIO.StringIO(body) gzipper = gzip.GzipFile(fileobj=body_stream) data = gzipper.read() print data[:25]

Related

Sort header from HTTP response before writing it to a file

How to get size of transmission using urllib2?

Checking if a file is being downloaded by Python Requests library

How to send binary post data via HTTP?

Http protocol, Content-Length, get page content Python

Categories

Resources