Trying to avoid varnish cache - python

I'm trying to avoid the varnish cache from client side. With nginx 1.6.1 it works with adding a random url parameter (see X-XHR-Current-Location) so it doesnt get the "X-Cache":"HIT".
Server: nginx/1.6.1
Date: Fri, 04 Sep 2015 13:13:02 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 20762
Status: 200 OK
X-XHR-Current-Location: /shop.json?1441372381.857126?1441372381.854355
X-UA-Compatible: IE=Edge,chrome=1
ETag: "de6da75aa7b7d6ce34bd736ccf991f36"
X-Request-Id: a39fc3b8d44687039a18499dd22a2c7d
X-Runtime: 0.371739
X-Rack-Cache: miss
Accept-Ranges: bytes
X-Varnish: 534989417
Age: 0
Via: 1.1 varnish
Cache-Control: private, max-age=0, must-revalidate
Pragma: no-cache
X-Cache: MISS
X-Cache: MISS from localhost
X-Cache-Lookup: MISS from localhost:3128
Connection: close
but as soons as I hit with the request a nginx/1.8.0 Server the URL gets somehow striped (see X-XHR-Current-Location) and the random parameter gets removed. Also the "X-Cache" gets triggered and returns a "HIT".
Server: nginx/1.8.0
Date: Fri, 04 Sep 2015 13:13:14 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 3555
Status: 200 OK
X-XHR-Current-Location: /shop/301316.json
X-UA-Compatible: IE=Edge,chrome=1
ETag: "2e88dffe16a385872368e19e0370a999"
X-Request-Id: 3404c637c6a499d8e32a6e5c243e4d69
X-Runtime: 0.065267
X-Rack-Cache: miss
Accept-Ranges: bytes
X-Varnish: 561085217 561069463
Age: 823
Via: 1.1 varnish
Cache-Control: private, max-age=0, must-revalidate
Pragma: no-cache
X-Cache: HIT
X-Cache: MISS from localhost
X-Cache-Lookup: MISS from localhost:3128
Connection: close
I guess thats also the reason I get old results sometimes. Is there any way I can avoid the "HIT" or also pretend to be a new URL for the nginx/1.8.0 servers?
thanks in advance!

Related

PY4E Question -- Exploring the HyperText Transport Protocol

I'm sorry for the basic question, but when I use the following code to retrieve a document:
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)
while True:
data = mysock.recv(512)
if len(data) < 1:
break
print(data.decode(),end='')
mysock.close()
What my terminal ends up returning is the following:
HTTP/1.1 302 Found
Location: http://google.com/
Connection: close
Content-Length: 0
Cache-Control: no-cache, no-store
Why won't it retrieve the file itself?
Your code works for me :
HTTP/1.1 200 OK
Date: Thu, 28 Apr 2022 20:20:16 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
It appears your code is accessing a different url, (presumably, http://google.com), and is getting a 302 redirect, asking your browser to try to get http://google.com/ instead.

urllib2 returns sometimes old page - returns strange header

I'm working on a python script which works with a JSON returned by an URL.
Since a couple of days urllib2 returns (just sometimes) an old state of the JSON.
I did add the headers "Cache-Control":"max-age=0" etc. still it sometimes happen.
If I print out the request info I get:
Server: nginx/1.8.0
Date: Thu, 03 Sep 2015 17:02:47 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 3539
Status: 200 OK
X-XHR-Current-Location: /shop/169464.json
X-UA-Compatible: IE=Edge,chrome=1
ETag: "b1fbe7a01e0832025a3afce23fc2ab56"
X-Request-Id: 4cc0d399f943ad09a903f18a6ce1c488
X-Runtime: 0.123033
X-Rack-Cache: miss
Accept-Ranges: bytes
X-Varnish: 1707606900 1707225496
Age: 2860
Via: 1.1 varnish
Cache-Control: private, max-age=0, must-revalidate
Pragma: no-cache
X-Cache: HIT
X-Cache: MISS from adsl
X-Cache-Lookup: MISS from adsl:21261
Connection: close
has it something to do with the header "Age" or "X-Cache-Rack"? Or any ideas how I can fix it?
thanks in advance!
try to fake the user-agent, remove cookies, drop sessions.
fake_user_agent = ['chrome','firefox','safari']
request = urllib2.Request(url)
request.add_header('User-Agent', get_random(fake_user_agent))
content = urllib2.build_opener().open(request)
if all doesn't work, then try using tor to change ip per request.
if nothing works, then you can't bypass it because you are most definitely connecting to the transparent proxy

Python Scripting Question

I have a question about writing a small tool that would provide the headers for any website. I am new to python but wanted to know if there is anything else other than encoding that I would have to account for in my code when developing the tool? I have a rough draft of my code shown below. Any pointers from the python coders?
#!/usr/bin/python
import sys, urllib
if len(sys.argv) == 2:
website = sys.argv[1]
website = urllib.urlopen(sys.argv[1])
if(website.code != 200):
print "Something went wrong here"
print website.code
exit(0)
print 'Printing the headers'
print '-----------------------------------------'
for header, value in website.headers.items() :
print header + ' : ' + value
Seems a fairly straightforward script (though this question seems more of a fit for stackoverflow). Couple comments, first curl -I is a useful command line tool to compare against. Second, even when you don't get 200 status, there are still often useful content or headers you may want to display. E.g.,
$ curl -I http://security.stackexchange.com/asdf
HTTP/1.1 404 Not Found
Cache-Control: private
Content-Length: 24068
Content-Type: text/html; charset=utf-8
X-Frame-Options: SAMEORIGIN
Set-Cookie: prov=678b5b9c-0130-4398-9834-673475961dc6; domain=.stackexchange.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
Date: Fri, 25 Apr 2014 07:24:00 GMT
Also note urllib follows redirects automatically. E.g., with curl you'll see:
$ curl -I http://www.security.stackexchange.com
HTTP/1.1 301 Moved Permanently
Content-Length: 157
Content-Type: text/html; charset=UTF-8
Location: http://security.stackexchange.com/
Date: Fri, 25 Apr 2014 07:26:52 GMT
while your tool will just give.
$ python user3567119.py http://www.security.stackexchange.com
Printing the headers
-----------------------------------------
content-length : 68639
set-cookie : prov=9bf4f3d4-e3ae-4161-8e34-9aaa83f0aa4b; domain=.stackexchange.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
expires : Fri, 25 Apr 2014 07:29:32 GMT
vary : *
last-modified : Fri, 25 Apr 2014 07:28:32 GMT
connection : close
cache-control : public, no-cache="Set-Cookie", max-age=60
date : Fri, 25 Apr 2014 07:28:31 GMT
x-frame-options : SAMEORIGIN
content-type : text/html; charset=utf-8
Third, if you continue playing around with HTTP requests in python, I highly recommend using requests. With requests, you'll be able to see the 301 if you do:
In [1]: import requests
In [2]: r=requests.get('http://www.security.stackexchange.com')
In [3]: r
Out[3]: <Response [200]>
In [4]: r.history
Out[4]: (<Response [301]>,)
It's also worth trying out some HTTP requests in just plain old telnet. E.g., telnet security.stackexchange.com 80 then quickly type:
GET / HTTP/1.1
Host: security.stackexchange.com
followed by a blank line. Then you'll see the actual HTTP response on the wire (instead of recreating it after urllib has processed the HTTP response):
HTTP/1.1 200 OK
Cache-Control: public, no-cache="Set-Cookie", max-age=60
Content-Type: text/html; charset=utf-8
Expires: Fri, 25 Apr 2014 07:38:37 GMT
Last-Modified: Fri, 25 Apr 2014 07:37:37 GMT
Vary: *
X-Frame-Options: SAMEORIGIN
Set-Cookie: prov=a75de1f2-678b-4a9d-bbfd-39e933e60237; domain=.stackexchange.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly
Date: Fri, 25 Apr 2014 07:37:36 GMT
Content-Length: 68849
<!DOCTYPE html>

Is there a way to tell if a page opened with Mechanize isn't returning "search results"?

I am using Mechanize to log in to a web site and make a search. After extracting the links/info I want, I then recurisively move from the current page to the next to the next page. What I'm wondering is if there's an easy way to tell -- based on header information, for instance -- if there are "No results found" or similar page. If so, I could quickly check the header for a "404" or no-results page and then return.
I couldn't find it in the documentation and from what I can tell the answer is no. Can anyone here say more definitely, tho, whether the answer is in fact no?? Thanks in advance.
(Presently I just do a .find() for 'no results' after I .read() the link.)
NOTES:
1) Header Info for a "good" page (with results):
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
header: Date: Thu, 12 Sep 2013 18:33:10 GMT
header: Content-Type: text/html; charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Vary: Accept-Encoding
header: Status: 200 OK
header: X-UA-Compatible: IE=Edge,chrome=1
header: Cache-Control: must-revalidate, private, max-age=0
header: X-Request-Id: b501064808b265fc6e478fa88e622710
header: X-Runtime: 0.478829
header: X-Rack-Cache: miss
header: Content-Encoding: gzip
2) Header Info from a "bad" (no results page)
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
header: Date: Thu, 12 Sep 2013 18:33:11 GMT
header: Content-Type: text/html; charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Vary: Accept-Encoding
header: Status: 200 OK
header: X-UA-Compatible: IE=Edge,chrome=1
header: Cache-Control: must-revalidate, private, max-age=0
header: X-Request-Id: 1ae89b2b25ba7983f8a48fa17f7a1798
header: X-Runtime: 0.127865
header: X-Rack-Cache: miss
header: Content-Encoding: gzip
The response header is generated by the server, you could add your own "no results" parameter and parse that...otherwise you have to analyze the content.
If you're set on using the header the only thing I can see between the two is that the bad search returned 4x faster -- maybe you could find a moving average for elapsed response times.

urllib2 python (Transfer-Encoding: chunked)

I used the following python code to download the html page:
response = urllib2.urlopen(current_URL)
msg = response.read()
print msg
For a page such as this one, it opens the url without error but then prints only part of the html-page!
In the following lines you can find the http headers of the html-page. I think the problem is due to "Transfer-Encoding: chunked".
It seems urllib2 returns only the first chunk! I have difficulties reading the remaining chunks. How I can read the remaining chunks?
Server: nginx/1.0.5
Date: Wed, 27 Feb 2013 14:41:28 GMT
Content-Type: text/html;charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Set-Cookie: route=c65b16937621878dd49065d7d58047b2; Path=/
Set-Cookie: JSESSIONID=EE18E813EE464664EA64086D5AE9A290.tpdjo13v_3; Path=/
Pragma: No-cache
Cache-Control: no-cache,no-store,max-age=0
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Vary: Accept-Encoding
Content-Language: fr
I've found out that if I Accept-Language header is specified than server doesn't drop TCP connection, otherwise it does.
curl -H "Accept-Language:uk,en-US;q=0.8,en;q=0.6,ru;q=0.4" -v 'http://www.legifrance.gouv.fr/affichJuriJudi.do?oldAction=rechJuriJudi&idTexte=JURITEXT000024053954&fastReqId=660326373&fastPos=1'

Categories

Resources