I have a question about writing a small tool that would provide the headers for any website. I am new to python but wanted to know if there is anything else other than encoding that I would have to account for in my code when developing the tool? I have a rough draft of my code shown below. Any pointers from the python coders?
#!/usr/bin/python
import sys, urllib
if len(sys.argv) == 2:
website = sys.argv[1]
website = urllib.urlopen(sys.argv[1])
if(website.code != 200):
print "Something went wrong here"
print website.code
exit(0)
print 'Printing the headers'
print '-----------------------------------------'
for header, value in website.headers.items() :
print header + ' : ' + value
Seems a fairly straightforward script (though this question seems more of a fit for stackoverflow). Couple comments, first curl -I is a useful command line tool to compare against. Second, even when you don't get 200 status, there are still often useful content or headers you may want to display. E.g.,
$ curl -I http://security.stackexchange.com/asdf
HTTP/1.1 404 Not Found
Cache-Control: private
Content-Length: 24068
Content-Type: text/html; charset=utf-8
X-Frame-Options: SAMEORIGIN
Set-Cookie: prov=678b5b9c-0130-4398-9834-673475961dc6; domain=.stackexchange.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
Date: Fri, 25 Apr 2014 07:24:00 GMT
Also note urllib follows redirects automatically. E.g., with curl you'll see:
$ curl -I http://www.security.stackexchange.com
HTTP/1.1 301 Moved Permanently
Content-Length: 157
Content-Type: text/html; charset=UTF-8
Location: http://security.stackexchange.com/
Date: Fri, 25 Apr 2014 07:26:52 GMT
while your tool will just give.
$ python user3567119.py http://www.security.stackexchange.com
Printing the headers
-----------------------------------------
content-length : 68639
set-cookie : prov=9bf4f3d4-e3ae-4161-8e34-9aaa83f0aa4b; domain=.stackexchange.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
expires : Fri, 25 Apr 2014 07:29:32 GMT
vary : *
last-modified : Fri, 25 Apr 2014 07:28:32 GMT
connection : close
cache-control : public, no-cache="Set-Cookie", max-age=60
date : Fri, 25 Apr 2014 07:28:31 GMT
x-frame-options : SAMEORIGIN
content-type : text/html; charset=utf-8
Third, if you continue playing around with HTTP requests in python, I highly recommend using requests. With requests, you'll be able to see the 301 if you do:
In [1]: import requests
In [2]: r=requests.get('http://www.security.stackexchange.com')
In [3]: r
Out[3]: <Response [200]>
In [4]: r.history
Out[4]: (<Response [301]>,)
It's also worth trying out some HTTP requests in just plain old telnet. E.g., telnet security.stackexchange.com 80 then quickly type:
GET / HTTP/1.1
Host: security.stackexchange.com
followed by a blank line. Then you'll see the actual HTTP response on the wire (instead of recreating it after urllib has processed the HTTP response):
HTTP/1.1 200 OK
Cache-Control: public, no-cache="Set-Cookie", max-age=60
Content-Type: text/html; charset=utf-8
Expires: Fri, 25 Apr 2014 07:38:37 GMT
Last-Modified: Fri, 25 Apr 2014 07:37:37 GMT
Vary: *
X-Frame-Options: SAMEORIGIN
Set-Cookie: prov=a75de1f2-678b-4a9d-bbfd-39e933e60237; domain=.stackexchange.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
Date: Fri, 25 Apr 2014 07:37:36 GMT
Content-Length: 68849
<!DOCTYPE html>
Related
I am writing an controller endpoint in Python and I want to modify response code, content-text etc. How can I achieve it.
#http.route('/get/data/',methods=['GET','POST'],type="http",auth='public',csrf=False)
def fetchSms(self,**kwargs):
mydata = {"date":"2018-10-13T00:46:17.25Z"}
return simplejson.dumps(mydata)
I have to return mydata to client after setting content-type="application/json" and response response code should be 200
Current Output:
HTTP/1.1 200 OK
Server: nginx/1.12.2
Date: Mon, 15 Oct 2018 16:57:49 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 192
Connection: keep-alive
Set-Cookie: session_id=43edf5hfgh436sdgdga9d7618deb74af7; Expires=Sun, 13-Jan-2019 16:57:49 GMT; Max-Age=7776000; Path=/
I'm trying to send a HEAD request to this url :
http://ubuntu-releases.mirror.net.in/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso
and get the size of the file. My current head request looks like this :
head_request = "HEAD " + file_path + " HTTP/1.0%s" % ('\r\n\r\n')
socket.socket(socket.AF_INET, socket.SOCK_STREAM).send(head_request)
where file_path is "/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso". This works perfectly but when I replace 1.0 by 1.1, I get a 400 HTTP Bad Request.
head_request = "HEAD " + file_path + " HTTP/1.1%s" % ('\r\n\r\n')
Why does this happen ?
In HTTP/1.1 you must provide the Host: header.
Demonstration using netcat (nc) utility:
$ nc ubuntu-releases.mirror.net.in 80 <<END
HEAD /releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso HTTP/1.0
END
HTTP/1.1 200 OK
Date: Sat, 04 Mar 2017 07:25:22 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Wed, 15 Feb 2017 21:44:24 GMT
ETag: "5ca30000-5489895805e00"
Accept-Ranges: bytes
Content-Length: 1554186240
Connection: close
Content-Type: application/x-iso9660-image
$ nc ubuntu-releases.mirror.net.in 80 <<END
HEAD /releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso HTTP/1.1
END
HTTP/1.1 400 Bad Request
Date: Sat, 04 Mar 2017 07:25:33 GMT
Server: Apache/2.4.18 (Ubuntu)
Connection: close
Content-Type: text/html; charset=iso-8859-1
$ nc ubuntu-releases.mirror.net.in 80 <<END
HEAD /releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso HTTP/1.1
Host: ubuntu-releases.mirror.net.in
END
HTTP/1.1 200 OK
Date: Sat, 04 Mar 2017 07:27:27 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Wed, 15 Feb 2017 21:44:24 GMT
ETag: "5ca30000-5489895805e00"
Accept-Ranges: bytes
Content-Length: 1554186240
Content-Type: application/x-iso9660-image
I'm trying to avoid the varnish cache from client side. With nginx 1.6.1 it works with adding a random url parameter (see X-XHR-Current-Location) so it doesnt get the "X-Cache":"HIT".
Server: nginx/1.6.1
Date: Fri, 04 Sep 2015 13:13:02 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 20762
Status: 200 OK
X-XHR-Current-Location: /shop.json?1441372381.857126?1441372381.854355
X-UA-Compatible: IE=Edge,chrome=1
ETag: "de6da75aa7b7d6ce34bd736ccf991f36"
X-Request-Id: a39fc3b8d44687039a18499dd22a2c7d
X-Runtime: 0.371739
X-Rack-Cache: miss
Accept-Ranges: bytes
X-Varnish: 534989417
Age: 0
Via: 1.1 varnish
Cache-Control: private, max-age=0, must-revalidate
Pragma: no-cache
X-Cache: MISS
X-Cache: MISS from localhost
X-Cache-Lookup: MISS from localhost:3128
Connection: close
but as soons as I hit with the request a nginx/1.8.0 Server the URL gets somehow striped (see X-XHR-Current-Location) and the random parameter gets removed. Also the "X-Cache" gets triggered and returns a "HIT".
Server: nginx/1.8.0
Date: Fri, 04 Sep 2015 13:13:14 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 3555
Status: 200 OK
X-XHR-Current-Location: /shop/301316.json
X-UA-Compatible: IE=Edge,chrome=1
ETag: "2e88dffe16a385872368e19e0370a999"
X-Request-Id: 3404c637c6a499d8e32a6e5c243e4d69
X-Runtime: 0.065267
X-Rack-Cache: miss
Accept-Ranges: bytes
X-Varnish: 561085217 561069463
Age: 823
Via: 1.1 varnish
Cache-Control: private, max-age=0, must-revalidate
Pragma: no-cache
X-Cache: HIT
X-Cache: MISS from localhost
X-Cache-Lookup: MISS from localhost:3128
Connection: close
I guess thats also the reason I get old results sometimes. Is there any way I can avoid the "HIT" or also pretend to be a new URL for the nginx/1.8.0 servers?
thanks in advance!
I used the following python code to download the html page:
response = urllib2.urlopen(current_URL)
msg = response.read()
print msg
For a page such as this one, it opens the url without error but then prints only part of the html-page!
In the following lines you can find the http headers of the html-page. I think the problem is due to "Transfer-Encoding: chunked".
It seems urllib2 returns only the first chunk! I have difficulties reading the remaining chunks. How I can read the remaining chunks?
Server: nginx/1.0.5
Date: Wed, 27 Feb 2013 14:41:28 GMT
Content-Type: text/html;charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Set-Cookie: route=c65b16937621878dd49065d7d58047b2; Path=/
Set-Cookie: JSESSIONID=EE18E813EE464664EA64086D5AE9A290.tpdjo13v_3; Path=/
Pragma: No-cache
Cache-Control: no-cache,no-store,max-age=0
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Vary: Accept-Encoding
Content-Language: fr
I've found out that if I Accept-Language header is specified than server doesn't drop TCP connection, otherwise it does.
curl -H "Accept-Language:uk,en-US;q=0.8,en;q=0.6,ru;q=0.4" -v 'http://www.legifrance.gouv.fr/affichJuriJudi.do?oldAction=rechJuriJudi&idTexte=JURITEXT000024053954&fastReqId=660326373&fastPos=1'
When attempting to check the 'content-length' header for some web pages using urllib2 in python, the header is missing. For example, the response from google.com is missing this header. Any idea why?
Example:
r = urllib2.urlopen('http://www.google.com')
i = r.info()
print i.keys()
Gives:
['x-xss-protection', 'set-cookie', 'expires', 'server', 'connection', 'cache-control', 'date', 'p3p', 'content-type', 'x-frame-options']
You can see here that an http response can either contain Content-Length or Transfer-Encoding: chunked.
However, when Transfer-Encoding: chunked is used in the header, after the headers, you'll get a hexadecimal string which if converted to decimal, will give you the length of the next chunk. And after the last chunk you'll get a 0 for this value which means you've reached the end of the file.
You can use regular expressions to get this hexadecimal value (not a must though)
read = #string containing a line or a part of the http response
hexPat = re.compile(r'([0-9A-F]+)\r\n', re.I)
match = re.search(hexPat, read)
chunkLen = int(match.group(1), 16) #converts hexadecimal to decimal
or You can just read the first hexadecimal value, get the length of the first chunk and receive that chunk, then get the length of the next chunk and so on till you find a 0
The Content-Length of a HEAD response SHOULD, but not always does include the Content-Length value of a GET response:
Stack Overflow does:
> telnet stackoverflow.com 80
HEAD / HTTP/1.1
Host: stackoverflow.com
HTTP/1.1 200 OK
Cache-Control: public, max-age=60
Content-Length: 362245 <--------
Content-Type: text/html; charset=utf-8
Expires: Mon, 04 Oct 2010 11:51:49 GMT
Last-Modified: Mon, 04 Oct 2010 11:50:49 GMT
Vary: *
Date: Mon, 04 Oct 2010 11:50:49 GMT
Google doesn't:
> telnet www.google.com 80
HEAD / HTTP/1.1
Host: www.google.ie
HTTP/1.1 200 OK
Date: Mon, 04 Oct 2010 11:55:36 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Server: gws
X-XSS-Protection: 1; mode=block
Transfer-Encoding: chunked