Cookies and HTTP with Python

Cookies and HTTP with Python - python

I wish to "retrieve" the cookies sent by the client in my subclass of BaseHTTPRequestHandler.
Firstly I'm unsure of the exact sequence of sending of headers, in a typical HTTP request and response this is my understanding of the sequence of events:
Client sends request (method, path, HTTP version, host, and ALL headers).
The server responds with a response code, followed by a bunch of headers of its own.
The server then sends the body of the response.
When exactly is the client's POST data sent? Does any overlapping occur in this sequence as described above?
Second, when is it safe to assume that the "Cookie" header has been received by the server. Should all of the client headers have been received by the time self.send_response is called by the server? When in the HTTP communication is the appropriate time to peek at cookie headers in self.headers?
Thirdly, what is the canonical way to parse cookies in Python. I currently believe a Cookie.SimpleCookie should be instantiated, and then data from the cookie headers some how fed into it. Further clouding this problem, is the Cookie classes clunkiness when dealing with the HTTPRequestHandler interfaces. Why does the output from Cookie.output() not end with a line terminator to fit into self.wfile.write(cookie.output()), or instead drop the implicitly provided header name to fit nicely into self.send_header("Set-Cookie", cookie.output())
Finally, the cookie classes in the Cookie module, give the illusion that they're dictionaries of dictionaries. Assigning to different keys in the cookie, does not pack more data into the cookie, but rather generates more cookies... all apparently in the one class, and each generating its own Set-Cookie header. What is the best practise for packing containers of values into cookie(s)?

HTTP is a request/response protocol, without overlap; the body of a POST comes as part of the request (when the verb is POST, of course).
All headers also come as part of the request, including Cookie: if any (there might be no such header of course, e.g. when the browser is running with cookies disabled or whatever). So peek at the headers whenever you've received the request and are serving it.
I'm not sure what your "thirdly" problem is. No newline gets inserted if none is part of the cookie -- why ever should it be? Edit: see later.
On the fourth point, I think you may be confusing cookies with "morsels". There is no limit to the number of Set-Cookie headers in the HTTP response, so why's that a problem?
Edit: you can optionally pass to output up to three arguments: the set of morsel attributes you want in the output for each morsel (default None meaning all attributes), the header string you want to use in front of each morsel (default Set-Cookie:), the separator string you want between morsels (default \r\n). So it seems that your intended use of a cookie is single-morsel (otherwise you couldn't stick the string representation into a single header, which you appear most keen to do): in that case
thecookie.output(None, '')
will give you exactly the string you want. Just make multiple SimpleCookie instances with one morsel each (since one morsel is what fits into a single header!-).

Here's a quick way to get the cookies with no 3rd-party-libraries. While it only answers a section of the question, it may be answering the one which most "visitors" will be after.
import Cookie
def do_GET(self):
cookies = {}
cookies_string = self.headers.get('Cookie')
if cookies_string:
cookies = Cookie.SimpleCookie()
cookies.load(cookies_string)
if 'my-cookie' in cookies:
print(cookies['my-cookie'].value)

Related

Unable to get complete source code of web page using Python [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?

Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.

One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)

In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

Twisted Proxy with range splitting

I want to write a twisted proxy that splits up very large GET request into smaller fixed size ranges and sends it on to another proxy (using the Range: bytes). The other proxy doesn't allow large responses and when the response is to large it returns a 502.
How can I implement a proxy in twisted that on a 502 error it tries splitting the request into smaller allowed chunks. The documentation is hard to follow. I know I need to extend ProxyRequest, but from there I'm a bit stuck.
It doesn't have to be a twisted proxy, but it seems to be easily modified and I managed to at least get it to forward the request unmodified to the proxy by just setting the connectTCP to my proxy (in ProxyRequest.parsed).

Extending ProxyRequest is probably not the easiest way to do this, actually; ProxyRequest pretty strongly assumes that one request = one response, whereas here you want to split up a single request into multiple requests.
Easier would be to simply write a Resource implementation that does what you want, which briefly would be:
in render_GET, construct a URL to make several outgoing requests using Agent
return NOT_DONE_YET
as each response comes in, call request.write on your original incoming requests, and then issue a new request with a Range header
finally when the last response comes in, call request.finish on your original request
You can simply construct a Site object with your Resource, and set isLeaf on your Resource to true so your Resource doesn't have to implement any traversal logic and can just build the URL using request.prePathURL and request.postpath. (request.postpath is sadly undocumented; it's a list of the not-yet-traversed path segments in the request).

Simple way to detect browser or script

Complexities aside, what is the simplest quirty-and-dirty way to detect in a request whether that request was issues by a CLI program, such as curl, or whether it was by a browser? Here is what I'm trying to figure out:
def view(request):
if request.is_from_browser:
return HTML_TEMPLATE
else:
return JSON

Request.is_ajax() checks if the HTTP_X_REQUESTED_WITH header equals XMLHttpRequest. This is becoming an "industry standard" among web frameworks/libraries to separate Ajax calls from normal requests. But it depends on cooperation from the client side to actually set the header. There's no 100 % foolproof way of detecting browser, client, Ajax etc without this cooperation.
Btw, why do you need to know what's calling?

Somthing in the HTTP request headers, I'd first try using the Accept header. with the accept header the client can specify what sort of content it wants.this puts the responsibily on the client.

Passing data in non-POST call

For a REST api, I know it is acceptable to pass data in a POST call:
if method == 'POST':
r = requests.post(url, headers=headers, data=body)
Is it acceptable to pass data in a PUT or DELETE call? Or are you not supposed to send any data parms, and only request the specified url?

RFC 7231 explains everything you need to know.
PUT is very similar to POST in a REST API.
POST
The POST method requests that the target resource process the representation enclosed in the request according to the resource's own specific semantics. RFC 7231 #4.3.3
PUT
The PUT method requests that the state of the target resource be created or replaced with the state defined by the representation enclosed in the request message payload. RFC 7231 #4.3.4
Both request data. Furthermore, the RFC explicitly highlights the difference, since it is indeed slight:
The fundamental difference between the POST and PUT methods is highlighted by the different intent for the enclosed representation. The target resource in a POST request is intended to handle the enclosed representation according to the resource's own semantics, whereas the enclosed representation in a PUT request is defined as replacing the state of the target resource. Hence, the intent of PUT is idempotent and visible to intermediaries, even though the exact effect is only known by the origin server.
Regarding DELETE, the RFC says the following:
A payload within a DELETE request message has no defined semantics; sending a payload body on a DELETE request might cause some existing implementations to reject the request.
I presume that means that you shouldn't send data. Regardless, the RFC does mention
Relatively few resources allow the DELETE method
which is spot on, in my opinion. You should just avoid DELETE altogether.

How to set ETAGS on Google App Engine for Python?

I am serving some JSON content from a Google App Engine server. I need to serve an ETAG for the content to know if its changed since i last loaded the data. Then my app will remove its old data and use the new JSON data to populate its views.
self.response.headers['Content-Type'] = "application/json; charset=utf-8"
self.response.out.write(json.dumps(to_dict(objects,"content")))
Whats the best practice to set ETAGs for the response? Do i have to calculate the ETAG myself? Or is it a way to get the HTTP protocol to do this?

If you're using webapp2, it can add an md5 ETag based on the response body automatically.
self.response.md5_etag()
http://webapp-improved.appspot.com/guide/response.html

You'll have to calculate the e-tag value yourself. E-tags are opaque strings that only have meaning to the application.
Best practice is to just concatenate all the input variables (converted to string) that determine the JSON content; anything that, if changed, would result in a change in the JSON output, should be part of this. If there is anything sensitive in those strings you wouldn't want to be exposed, use the MD5 hash of the values instead.
For example, in a CMS application that I administer, the front page has the following e-tag:
|531337735|en-us;en;q=0.5|0|Eli Visual Theme|1|943ed3c25e6d44497deb3fe274f98a96||
The input variables that we care about have been concatenated with the | symbol into an opaque value, but it does represent several distinct input values, such as a last-modified timestamp (the number), the browser accepted language header, the current visual theme, and a internal UID that is retrieved from a browser cookie (and which determines what context the content on the front page is taken from). If any of those variables would change, the page is likely to be different and the cached copy would be stale.
Note that an e-tag is useless without a means to verify it quickly. A client will include it in a If-None-Match request header, and the server should be able to quickly re-calculate the e-tag header from the current variables and see if the tag is still current. If that re-calculation would take the same amount of time as regenerating the content body, you only save a little bandwidth sending the 304 Not Modified response instead of a full JSON body in a 200 OK response.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.