I'm sorry for the basic question, but when I use the following code to retrieve a document:
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)
while True:
data = mysock.recv(512)
if len(data) < 1:
break
print(data.decode(),end='')
mysock.close()
What my terminal ends up returning is the following:
HTTP/1.1 302 Found
Location: http://google.com/
Connection: close
Content-Length: 0
Cache-Control: no-cache, no-store
Why won't it retrieve the file itself?
Your code works for me :
HTTP/1.1 200 OK
Date: Thu, 28 Apr 2022 20:20:16 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
It appears your code is accessing a different url, (presumably, http://google.com), and is getting a 302 redirect, asking your browser to try to get http://google.com/ instead.
Related
I'm trying to send a HEAD request to this url :
http://ubuntu-releases.mirror.net.in/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso
and get the size of the file. My current head request looks like this :
head_request = "HEAD " + file_path + " HTTP/1.0%s" % ('\r\n\r\n')
socket.socket(socket.AF_INET, socket.SOCK_STREAM).send(head_request)
where file_path is "/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso". This works perfectly but when I replace 1.0 by 1.1, I get a 400 HTTP Bad Request.
head_request = "HEAD " + file_path + " HTTP/1.1%s" % ('\r\n\r\n')
Why does this happen ?
In HTTP/1.1 you must provide the Host: header.
Demonstration using netcat (nc) utility:
$ nc ubuntu-releases.mirror.net.in 80 <<END
HEAD /releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso HTTP/1.0
END
HTTP/1.1 200 OK
Date: Sat, 04 Mar 2017 07:25:22 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Wed, 15 Feb 2017 21:44:24 GMT
ETag: "5ca30000-5489895805e00"
Accept-Ranges: bytes
Content-Length: 1554186240
Connection: close
Content-Type: application/x-iso9660-image
$ nc ubuntu-releases.mirror.net.in 80 <<END
HEAD /releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso HTTP/1.1
END
HTTP/1.1 400 Bad Request
Date: Sat, 04 Mar 2017 07:25:33 GMT
Server: Apache/2.4.18 (Ubuntu)
Connection: close
Content-Type: text/html; charset=iso-8859-1
$ nc ubuntu-releases.mirror.net.in 80 <<END
HEAD /releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso HTTP/1.1
Host: ubuntu-releases.mirror.net.in
END
HTTP/1.1 200 OK
Date: Sat, 04 Mar 2017 07:27:27 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Wed, 15 Feb 2017 21:44:24 GMT
ETag: "5ca30000-5489895805e00"
Accept-Ranges: bytes
Content-Length: 1554186240
Content-Type: application/x-iso9660-image
I'm trying to avoid the varnish cache from client side. With nginx 1.6.1 it works with adding a random url parameter (see X-XHR-Current-Location) so it doesnt get the "X-Cache":"HIT".
Server: nginx/1.6.1
Date: Fri, 04 Sep 2015 13:13:02 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 20762
Status: 200 OK
X-XHR-Current-Location: /shop.json?1441372381.857126?1441372381.854355
X-UA-Compatible: IE=Edge,chrome=1
ETag: "de6da75aa7b7d6ce34bd736ccf991f36"
X-Request-Id: a39fc3b8d44687039a18499dd22a2c7d
X-Runtime: 0.371739
X-Rack-Cache: miss
Accept-Ranges: bytes
X-Varnish: 534989417
Age: 0
Via: 1.1 varnish
Cache-Control: private, max-age=0, must-revalidate
Pragma: no-cache
X-Cache: MISS
X-Cache: MISS from localhost
X-Cache-Lookup: MISS from localhost:3128
Connection: close
but as soons as I hit with the request a nginx/1.8.0 Server the URL gets somehow striped (see X-XHR-Current-Location) and the random parameter gets removed. Also the "X-Cache" gets triggered and returns a "HIT".
Server: nginx/1.8.0
Date: Fri, 04 Sep 2015 13:13:14 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 3555
Status: 200 OK
X-XHR-Current-Location: /shop/301316.json
X-UA-Compatible: IE=Edge,chrome=1
ETag: "2e88dffe16a385872368e19e0370a999"
X-Request-Id: 3404c637c6a499d8e32a6e5c243e4d69
X-Runtime: 0.065267
X-Rack-Cache: miss
Accept-Ranges: bytes
X-Varnish: 561085217 561069463
Age: 823
Via: 1.1 varnish
Cache-Control: private, max-age=0, must-revalidate
Pragma: no-cache
X-Cache: HIT
X-Cache: MISS from localhost
X-Cache-Lookup: MISS from localhost:3128
Connection: close
I guess thats also the reason I get old results sometimes. Is there any way I can avoid the "HIT" or also pretend to be a new URL for the nginx/1.8.0 servers?
thanks in advance!
I'm working on a python script which works with a JSON returned by an URL.
Since a couple of days urllib2 returns (just sometimes) an old state of the JSON.
I did add the headers "Cache-Control":"max-age=0" etc. still it sometimes happen.
If I print out the request info I get:
Server: nginx/1.8.0
Date: Thu, 03 Sep 2015 17:02:47 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 3539
Status: 200 OK
X-XHR-Current-Location: /shop/169464.json
X-UA-Compatible: IE=Edge,chrome=1
ETag: "b1fbe7a01e0832025a3afce23fc2ab56"
X-Request-Id: 4cc0d399f943ad09a903f18a6ce1c488
X-Runtime: 0.123033
X-Rack-Cache: miss
Accept-Ranges: bytes
X-Varnish: 1707606900 1707225496
Age: 2860
Via: 1.1 varnish
Cache-Control: private, max-age=0, must-revalidate
Pragma: no-cache
X-Cache: HIT
X-Cache: MISS from adsl
X-Cache-Lookup: MISS from adsl:21261
Connection: close
has it something to do with the header "Age" or "X-Cache-Rack"? Or any ideas how I can fix it?
thanks in advance!
try to fake the user-agent, remove cookies, drop sessions.
fake_user_agent = ['chrome','firefox','safari']
request = urllib2.Request(url)
request.add_header('User-Agent', get_random(fake_user_agent))
content = urllib2.build_opener().open(request)
if all doesn't work, then try using tor to change ip per request.
if nothing works, then you can't bypass it because you are most definitely connecting to the transparent proxy
I have a question about writing a small tool that would provide the headers for any website. I am new to python but wanted to know if there is anything else other than encoding that I would have to account for in my code when developing the tool? I have a rough draft of my code shown below. Any pointers from the python coders?
#!/usr/bin/python
import sys, urllib
if len(sys.argv) == 2:
website = sys.argv[1]
website = urllib.urlopen(sys.argv[1])
if(website.code != 200):
print "Something went wrong here"
print website.code
exit(0)
print 'Printing the headers'
print '-----------------------------------------'
for header, value in website.headers.items() :
print header + ' : ' + value
Seems a fairly straightforward script (though this question seems more of a fit for stackoverflow). Couple comments, first curl -I is a useful command line tool to compare against. Second, even when you don't get 200 status, there are still often useful content or headers you may want to display. E.g.,
$ curl -I http://security.stackexchange.com/asdf
HTTP/1.1 404 Not Found
Cache-Control: private
Content-Length: 24068
Content-Type: text/html; charset=utf-8
X-Frame-Options: SAMEORIGIN
Set-Cookie: prov=678b5b9c-0130-4398-9834-673475961dc6; domain=.stackexchange.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
Date: Fri, 25 Apr 2014 07:24:00 GMT
Also note urllib follows redirects automatically. E.g., with curl you'll see:
$ curl -I http://www.security.stackexchange.com
HTTP/1.1 301 Moved Permanently
Content-Length: 157
Content-Type: text/html; charset=UTF-8
Location: http://security.stackexchange.com/
Date: Fri, 25 Apr 2014 07:26:52 GMT
while your tool will just give.
$ python user3567119.py http://www.security.stackexchange.com
Printing the headers
-----------------------------------------
content-length : 68639
set-cookie : prov=9bf4f3d4-e3ae-4161-8e34-9aaa83f0aa4b; domain=.stackexchange.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
expires : Fri, 25 Apr 2014 07:29:32 GMT
vary : *
last-modified : Fri, 25 Apr 2014 07:28:32 GMT
connection : close
cache-control : public, no-cache="Set-Cookie", max-age=60
date : Fri, 25 Apr 2014 07:28:31 GMT
x-frame-options : SAMEORIGIN
content-type : text/html; charset=utf-8
Third, if you continue playing around with HTTP requests in python, I highly recommend using requests. With requests, you'll be able to see the 301 if you do:
In [1]: import requests
In [2]: r=requests.get('http://www.security.stackexchange.com')
In [3]: r
Out[3]: <Response [200]>
In [4]: r.history
Out[4]: (<Response [301]>,)
It's also worth trying out some HTTP requests in just plain old telnet. E.g., telnet security.stackexchange.com 80 then quickly type:
GET / HTTP/1.1
Host: security.stackexchange.com
followed by a blank line. Then you'll see the actual HTTP response on the wire (instead of recreating it after urllib has processed the HTTP response):
HTTP/1.1 200 OK
Cache-Control: public, no-cache="Set-Cookie", max-age=60
Content-Type: text/html; charset=utf-8
Expires: Fri, 25 Apr 2014 07:38:37 GMT
Last-Modified: Fri, 25 Apr 2014 07:37:37 GMT
Vary: *
X-Frame-Options: SAMEORIGIN
Set-Cookie: prov=a75de1f2-678b-4a9d-bbfd-39e933e60237; domain=.stackexchange.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
Date: Fri, 25 Apr 2014 07:37:36 GMT
Content-Length: 68849
<!DOCTYPE html>
I used the following python code to download the html page:
response = urllib2.urlopen(current_URL)
msg = response.read()
print msg
For a page such as this one, it opens the url without error but then prints only part of the html-page!
In the following lines you can find the http headers of the html-page. I think the problem is due to "Transfer-Encoding: chunked".
It seems urllib2 returns only the first chunk! I have difficulties reading the remaining chunks. How I can read the remaining chunks?
Server: nginx/1.0.5
Date: Wed, 27 Feb 2013 14:41:28 GMT
Content-Type: text/html;charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Set-Cookie: route=c65b16937621878dd49065d7d58047b2; Path=/
Set-Cookie: JSESSIONID=EE18E813EE464664EA64086D5AE9A290.tpdjo13v_3; Path=/
Pragma: No-cache
Cache-Control: no-cache,no-store,max-age=0
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Vary: Accept-Encoding
Content-Language: fr
I've found out that if I Accept-Language header is specified than server doesn't drop TCP connection, otherwise it does.
curl -H "Accept-Language:uk,en-US;q=0.8,en;q=0.6,ru;q=0.4" -v 'http://www.legifrance.gouv.fr/affichJuriJudi.do?oldAction=rechJuriJudi&idTexte=JURITEXT000024053954&fastReqId=660326373&fastPos=1'