Disable browser caching in pylons - python

I'm have an action /json that returns json from the server.
Unfortunately in IE, the browser likes to cache this json.
How can I make it so that this action doesn't cache?

Make sure your response headers have:
Cache-Control: no-cache
Pragma: no-cache
Expires=-1

Make sure your responses are not telling the browser that the content expires in the future. There are two HTTP headers the control this.
Expires
Cache-Control - There are many possible values for this header, but the one that controls expiration is max-age=foo.
In addition, IE may be revalidating. This means that IE includes some extra information in the request that tell the web server what version of the resource it has in its cache. If the browser's cached version is current, your server can respond with 304 Not Modified and NOT include the content in the responses. "Conditionatl GET requests" include this versioning information. It's possible that your server is giving 304 responses when it shouldn't be.
There are two sets of headers that control revalidation:
Last-Modified + If-Modified-Since
ETag + If-None-Match
Last-Modified, and ETag are response headers that tell the browser what the version of the resource it is about to receive. If you don't want browsers to revalidate, don't set these. If-Modified-Since and If-None-Match are the corresponding request headers that the browser uses to report the version of a stale resource that it needs to revalidate with the server.
There are various tools to see what HTTP headers your server is sending back to the browser. One is the Firefox extension Live HTTP Headers. Another tool, which Steve Sounders recommends is IBM Page Detailer. I haven't tried this one myself, but it doesn't depend on the browser that you're using.

This is a common problem -- IE caches all ajax/json requests on the client side. Other browsers do not.
To work around it, generate a random number and append it to your request url as a variable. This fools IE into thinking it's a new request.
Here's an example in javascript, you can do something similar in Python:
function rand() {
return Math.floor(Math.random()*100000);
}
$("#content").load("/posts/view/1?rand="+rand());

The jQuery library has pretty nice ajax functions, and settings to control them. One of them is is called "cache" and it will automatically append a random number to the query that essentially forces the browser to not cache the page. This can be set along with the parameter "dataType", which can be set to "json" to make the ajax request get json data. I've been using this in my code and haven't had a problem with IE.
Hope this helps

Related

Unable to get complete source code of web page using Python [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

Django Rest Framework: redirect to Amazon S3 fails when using Token Authentication

I'm using token authentication in DRF and for a certain API call, want to redirect to S3 (using a URL like https://my_bucket.s3.amazonaws.com/my/file/path/my_file.jpg?Signature=MY_AWS_SIGNATURE&AWSAccessKeyId=MY_AWS_ACCESS_KEY_ID). However, I get the following error from AWS:
<Error>
<Code>InvalidArgument</Code>
<Message>Only one auth mechanism allowed; only the X-Amz-Algorithm query parameter, Signature query string parameter or the Authorization header should be specified</Message>
<ArgumentName>Authorization</ArgumentName>
<ArgumentValue>Token a3f61c10592272399099882eb178bd4b755af5bf</ArgumentValue>
<RequestId>E4038228DD1E6330</RequestId>
<HostId>9c2xX59cugrR0CHjxQJR8IBE4MXBbNMX+wX2JdPJEuerkAftc32rufotM7COKLIavakByuRUXOo=</HostId>
</Error>
It's clear why this happens--the Authorization header with DRF's token is propagated with the redirect and S3 doesn't like it.
After researching and trying a million ways to get rid of that header, I gave up and decided to try and override the header with an S3 value: AWS MY_AWS_SIGNATURE:MY_AWS_ACCESS_KEY_ID, after which I get a different error:
<Error>
<Code>InvalidArgument</Code>
<Message>Unsupported Authorization Type</Message>
<ArgumentName>Authorization</ArgumentName>
<ArgumentValue>Token a3f61c10592272399099882eb178bd4b755af5bf</ArgumentValue>
<RequestId>94D5ADA28C6A5BFB</RequestId>
<HostId>1YznL6UC3V0+nCvilsriHDAnP2/h3MoDlIJ/L+0V6w7nbHbf2bSxoQflujGmQ5PrUZpNiH7GywI=</HostId>
</Error>
As you can see, the end result is the same--even if I override the Authorization header in my response, it still keeps the original DRF token authentication value.
# relevant portion of my response construction
headers = {'Location': 'https://my_bucket.s3.amazonaws.com/my/file/path/my_file.jpg',
'Authorization': 'AWS %s:%s' % (params['AWSAccessKeyId'], params['Signature'])}
return Response(status=status.HTTP_302_FOUND, headers=headers)
So, my question is, how can the Authorization header in a DRF response be either removed or overridden?
Redirecting the Authorization header is the responsibility of the client (eg. browsers, cURL, HTTP libraries/toolkits).
For example, Paw, my toolkit to query my APIs offers that kind of configuration:
So basically, major browsers are tend to redirect the Authorization header which cause the conflict on S3.
Also I suspect you misunderstood how redirects are performed:
When DRF issues a redirect, it returns an HTTP 301 or 302 response to the client, containing the new Location header (the request is not "forwarded" directly via DRF)
Then, the client requests this new URI
And finally, you're not overriding any Authorization header when you're emitting your 302 since this is the response... to the client (which can carry an Authorization header but that's useless).
Right now, you have a bunch of solutions (thus not out of the box...):
Passing your token through a different header to avoid the conflict (X-Token for example)
Passing your token through a HTTP GET parameter (?token=blah)
Using your DRF view to proxy the S3 object (no redirect then)
The first two solutions may break in some way the consistency of your API but in some way are fair enough. They would require a custom TokenAuthentication or get_authorization_header (from rest_framework.authorization).
The last one is transparent but may be totally unsuitable depending on the object your're fetching on S3 and/or your hosting constraints...
That's all I can tell you for now. As you know, I've been stuck with the same situation too. I would be so pleased if anyone could suggest a better solution.
I was using Mod Header google chrome extension to provide OAuth2 token in each API request (django REST framework browsable API). But it also include the Authorization header while making request to S3 static resource (js, css and other).
Prefered way is to use POSTMAN and turn off the the Token Authorization forwarding (I think it is by default. For me it API worked perfectly this way). But we lose the browsable API convenience.
Another is to use Requestly google chrome extension. Here we can prevent Authorization Token forwarding when browser requests of another URL (like https://.s3.amazonaws.com/).
configure Requestly like this. (Below is a example IP with PORT we can add this as an condition before applying Authorization in request header. Thus like of S3 do not give error and static data is served properly.)

Odd redirect location causes proxy error with urllib2

I am using urllib2 to do an http post request using Python 2.7.3. My request is returning an HTTPError exception (HTTP Error 502: Proxy Error).
Looking at the messages traffic with Charles, I see the following is happening:
I send the HTTP request (POST /index.asp?action=login HTTP/1.1) using urllib2
The remote server replies with status 303 and a location header of ../index.asp?action=news
urllib2 retries sending a get request: (GET /../index.asp?action=news HTTP/1.1)
The remote server replies with status 502 (Proxy error)
The 502 reply includes this in the response body: "DNS lookup failure for: 10.0.0.30:80index.asp" (Notice the malformed URL)
So I take this to mean that a proxy server on the remote server's network sees the "/../index.asp" URL in the request and misinterprets it, sending my request on with a bad URL.
When I make the same request with my browser (Chrome), the retry is sent to GET /index.asp?action=news. So Chrome takes off the leading "/.." from the URL, and the remote server replies with a valid response.
Is this a urllib2 bug? Is there something I can do so the retry ignores the "/.." in the URL? Or is there some other way to solve this problem? Thinking it might be a urllib2 bug, I swapped out urllib2 with requests but requests produced the same result. Of course, that may be because requests is built on urllib2.
Thanks for any help.
The Location being sent with that 302 is wrong in multiple ways.
First, if you read RFC2616 (HTTP/1.1 Header Field Definitions) 14.30 Location, the Location must be an absoluteURI, not a relative one. And section 10.3.3 makes it clear that this is the relevant definition.
Second, even if a relative URI were allowed, RFC 1808, Relative Uniform Resource Locators, 4. Resolving Relative URLs, step 6, only specifies special handling for .. in the pattern <segment>/../. That means that a relative URL shouldn't start with ... So, even if the base URL is http://example.com/foo/bar/ and the relative URL is ../baz/, the resolved URL is not http://example.com/foo/baz/, but http://example.com/foo/bar/../baz. (Of course most servers will treat these the same way, but that's up to each server.)
Finally, even if you did combine the relative and base URLs before resolving .., an absolute URI with a path starting with .. is invalid.
So, the bug is in the server's configuration.
Now, it just so happens that many user-agents will work around this bug. In particular, they turn /../foo into /foo to block users (or arbitrary JS running on their behalf without their knowledge) from trying to do "escape from webroot" attacks.
But that doesn't mean that urllib2 should do so, or that it's buggy for not doing so. Of course urllib2 should detect the error earlier so it can tell you "invalid path" or something, instead of running together an illegal absolute URI that's going to confuse the server into sending you back nonsense errors. But it is right to fail.
It's all well and good to say that the server configuration is wrong, but unless you're the one in charge of the server, you'll probably face an uphill battle trying to convince them that their site is broken and needs to be fixed when it works with every web browser they care about. Which means you may need to write your own workaround to deal with their site.
The way to do that with urllib2 is to supply your own HTTPRedirectHandler with an implementation of redirect_request method that recognizes this case and returns a different Request than the default code would (in particular, http://example.com/index.asp?action=news instead of http://example.com/../index.asp?action=news).

Does urllib2.urlopen() cache stuff?

They didn't mention this in python documentation. And recently I'm testing a website simply refreshing the site using urllib2.urlopen() to extract certain content, I notice sometimes when I update the site urllib2.urlopen() seems not get the newly added content. So I wonder it does cache stuff somewhere, right?
So I wonder it does cache stuff somewhere, right?
It doesn't.
If you don't see new data, this could have many reasons. Most bigger web services use server-side caching for performance reasons, for example using caching proxies like Varnish and Squid or application-level caching.
If the problem is caused by server-side caching, usally there's no way to force the server to give you the latest data.
For caching proxies like squid, things are different. Usually, squid adds some additional headers to the HTTP response (response().info().headers).
If you see a header field called X-Cache or X-Cache-Lookup, this means that you aren't connected to the remote server directly, but through a transparent proxy.
If you have something like: X-Cache: HIT from proxy.domain.tld, this means that the response you got is cached. The opposite is X-Cache MISS from proxy.domain.tld, which means that the response is fresh.
Very old question, but I had a similar problem which this solution did not resolve.
In my case I had to spoof the User-Agent like this:
request = urllib2.Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
content = urllib2.build_opener().open(request)
Hope this helps anyone...
Your web server or an HTTP proxy may be caching content. You can try to disable caching by adding a Pragma: no-cache request header:
request = urllib2.Request(url)
request.add_header('Pragma', 'no-cache')
content = urllib2.build_opener().open(request)
If you make changes and test the behaviour from browser and from urllib, it is easy to make a stupid mistake.
In browser you are logged in, but in urllib.urlopen your app can redirect you always to the same login page, so if you just see the page size or the top of your common layout, you could think that your changes have no effect.
I find it hard to believe that urllib2 does not do caching, because in my case, upon restart of the program the data is refreshed. If the program is not restarted, the data appears to be cached forever. Also retrieving the same data from Firefox never returns stale data.

Can we only get the web page header information and not the body? (Mechanize)

What if I only need to download the page if it has not changed since the last download?
What is the best way? can I get the size of the page first, then compare the decide if it has changed, if so, I ask for download else skip?
I plan to use (python) mechanize.
the request should be a HEAD, not a GET:
9.4 HEAD
The HEAD method is identical to GET
except that the server MUST NOT return
a message-body in the response. The
metainformation contained in the HTTP
headers in response to a HEAD request
SHOULD be identical to the information
sent in response to a GET request.
This method can be used for obtaining
metainformation about the entity
implied by the request without
transferring the entity-body itself.
This method is often used for testing
hypertext links for validity,
accessibility, and recent
modification.
The response to a HEAD request MAY be
cacheable in the sense that the
information contained in the response
MAY be used to update a previously
cached entity from that resource. If
the new field values indicate that the
cached entity differs from the current
entity (as would be indicated by a
change in Content-Length, Content-MD5,
ETag or Last-Modified), then the cache
MUST treat the cache entry as stale.
See here How can I perform a HEAD request with the mechanize library?
yes you can get more information in python mechanize by setting like this
br = mechanize.Browser()
br.set_debug_http(True)
br.set_debug_redirects(True)
... Your code here ...
by doing this, you can get valuable header information of the page

Categories

Resources