Python tornado AsyncHTTPClient fluke

Python tornado AsyncHTTPClient fluke - python

I have a problem here:
import tornado.httpclient
from tornado.httpclient import AsyncHTTPClient
AsyncHTTPClient.configure("tornado.curl_httpclient.CurlAsyncHTTPClient")
# inside a function
client = AsyncHTTPClient()
result = yield client.fetch('http://some-site.com/#hash?&key=value', raise_error=False)
print(result.effective_url) # prints: http://some.site/some/path/
Note that key-values go after hash. Some site that I scrape gives redirects like this. If I comment out "AsyncHTTPClient.configure('tornado.curl_httpclient.CurlAsyncHTTPClient')" that all works fine, but I cannot use proxy to intersect and view the HTTP exchanges. And with this line staff after hash disappears... Can Anyone tell me why?

Everything after the # is called the "fragment", and it is not normally sent to the server. Instead, it is made available for the browser and javascript to use. At the level of HTTP, http://some-site.com/#hash?&key=value is equivalent to http://some-site.com/. AsyncHTTPClient should be stripping off the fragment whether you use curl or not; the difference you're seeing here is probably a bug.

You want to pass #fragment part. It is used by browsers to navigate through anchors on page or client side routing (more info rfc3985 3.5).
Fragment is not send to the server by browser. Also libcurl do not send the fragment part, as doc says:
While space is not typically a "legal" letter, libcurl accepts them.
When a user wants to pass in a '#' (hash) character it will be treated
as a fragment and get cut off by libcurl if provided literally. You
will instead have to escape it by providing it as backslash and its
ASCII value in hexadecimal: "\23".
You could replace # with %23 as well, but the server should know how to handle it, more likely it does not since it is the part handled by a browser.

Related

Unable to get complete source code of web page using Python [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?

Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.

One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)

In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

Django URL pattern to include a #

I'm having issues getting a URL pattern to work.
The URL is in the format of the following:
/API#access_token=<string>&expires_in=<timestamp>
I can't change the #access_token=&expires_in= part unfortunately, as this is outside of my control, and I simply have to just make my side of the code work.
I've tried a number of different patterns, a number of which are outlined below. This is my first Django project, and any advice, and pointers would be much appreciated.
url(r'^API#access_token=(?P<token_info>\w+)&expires_in(?P<time>\d+)$'
url(r'^API#(?P<tokens>\w+)$'
url(r'^API/#(?P<tokens>\w+)&(?P<expiration>\d+)$'

The issue is that the anchor #, also called the fragment identifier, is not sent to the server by the browser. The regex cannot capture what is not there. From the wikipedia article on the fragment identifier:
The fragment identifier functions differently than the rest of the
URI: namely, its processing is exclusively client-side with no
participation from the web server — of course the server typically
helps to determine the MIME type, and the MIME type determines the
processing of fragments. When an agent (such as a Web browser)
requests a web resource from a Web server, the agent sends the URI to
the server, but does not send the fragment. Instead, the agent waits
for the server to send the resource, and then the agent processes the
resource according to the document type and fragment value.
The only way around this is to parse the fragment in JavaScript on the client side and send it as a separate asynchronous request. For a GET request, you could send the fragment as a query parameter (after stripping off the hash) or put it in the header as a custom value.

Flask Redirect URLs escaping

I'm trying to redirect to a Graphite URL with Flask. The graphite URLs I'm building are complex and must include the literal characters {, }, and |. Flask is escaping them to %7B %7C
and %7D.
Is there any way I can stop this? On the graphite side, I want a target that looks like this: sumSeries({metric|metric|metric})
#app.route("/")
def index():
instances = get_data()
url = build_graphite_url(instances)
print url
return redirect(url)

If you dig into the Flask source you will eventually run into a function called get_wsgi_headers in wrappers.py under werkzeug: See here.
This function is called when the final response is created and returned, and if you scroll down a little you will find that it checks to see if a location header was set and if so, does some auto correction to make sure the url is absolute. During this time, it needs to escape the url, which is why your URL is escaped.
From the best of my knowledge, the only way to prevent this is to patch get_wsgi_headers so that it will basically not escape certain characters, since after all Flask is open source :)
Also as a side note, the reason why you cannot listen for the after_request callback and modify the response headers is because werkzeug's get_wsgi_headers is called after the callback, so whatever Location you set in the callback will end up being escaped as well.

Inconsistent decoding of encoded URL query parameters using Python and GAE

I'm trying to get consistent URL strings in my mobile client before submitting, and on the server once received, to be able to reliably add a hash for security checksum purposes. Currently I'm adding the hash after URL-encoding on the client, and attempting to grab the URL before anything gets decoded on the server, but I'm getting one character (a period) already decoded:
When I post something like this:
https://myapp.appspot.com/endpt?par=0%3Afirstlast%40gmail%2Ecom&di . . .
From this on the server:
self.request.url
I get:
https://myapp.appspot.com/endpt?par=0%3Afirstlast%40gmail.com&di . . .
And from this:
self.request.get('par')
I get it completely decoded as I would expect:
0:firstlast#gmail.com
I'm wondering how I can grab the URL before ANY decoding happens? Or alternately, I could do my hashing outside of the encoding/decoding if it's possible to grab the URL with the entire query portion decoded? I.e. I can inject my hash at any point that I can get consistent, reliable results. Thanks.

You could fetch this directly from your WSGI environment, but I'd suggest taking a page out of Amazon's book instead, and defining a canonical format for signing URLs. Then you can encode and format the URLs in the same way on both ends, and you don't have to rely on the vagaries of frameworks and proxies not to interfere with the trivial encoding details of your URLs.

redirect browser in SimpleHTTPServer.py?

I am partially through implementing the functionality of SimpleHTTPServer.py in Scheme. I am having some good fun with HTTP request/response mechanism. While going through the above file, I came across this- " # redirect browser - doing basically what apache does" in the code".
Why is this redirection necessary in such a scenario?

Imagine you serve a page
http://mydomain.com/bla
that contains
Read more...
On click, the user's browser would retrieve http://mydomain.com/more.html. Had you instead served
http://mydomain.com/bla/
(with the same content), the browser would retrieve http://mydomain.com/bla/more.html. To avoid this ambiguity, the redirection appends a slash if the URL points to a directory.

It simplifies things to treat the trailing / as irrelevant when the user does a GET on a directory, so that (say) http://www.foo.com/bar and http://www.foo.com/bar/ have exactly the same effect. Simplest (though not fastest, see Souders' books;-) is to have the former cause a redirect to the latter.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.